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Preface 


Technological advances in artificial intelligence (AI) are leading the rapidly changing 
world of the twenty-first century. We have already passed from machine learning to 
deep learning with numerous applications. The contribution of AI so far to the 
improvement of our quality of life is profound. Major challenges but also risks and 
threats are here. Brain-inspired computing explores, simulates, and imitates the struc- 
ture and the function of the human brain, achieving high-performance modeling plus 
visualization capabilities. 

The International Conference on Artificial Neural Networks (ICANN) is the annual 
flagship conference of the European Neural Network Society (ENNS). It features the 
main tracks “Brain-Inspired Computing” and “Machine Learning Research,” with 
strong cross-disciplinary interactions and applications. All research fields dealing with 
neural networks are present. 

The 27th ICANN was held during October 4—7, 2018, at the Aldemar Amilia Mare 
five-star resort and conference center in Rhodes, Greece. The previous ICANN events 
were held in Helsinki, Finland (1991), Brighton, UK (1992), Amsterdam, The 
Netherlands (1993), Sorrento, Italy (1994), Paris, France (1995), Bochum, Germany 
(1996), Lausanne, Switzerland (1997), Skovde, Sweden (1998), Edinburgh, UK 
(1999), Como, Italy (2000), Vienna, Austria (2001), Madrid, Spain (2002), Istanbul, 
Turkey (2003), Budapest, Hungary (2004), Warsaw, Poland (2005), Athens, Greece 
(2006), Porto, Portugal (2007), Prague, Czech Republic (2008), Limassol, Cyprus 
(2009), Thessaloniki, Greece (2010), Espoo-Helsinki, Finland (2011), Lausanne, 
Switzerland (2012), Sofia, Bulgaria (2013), Hamburg, Germany (2014), Barcelona, 
Spain (2016), and Alghero, Italy (2017). 

Following a long-standing tradition, these Springer volumes belong to the Lecture 
Notes in Computer Science Springer series. They contain the papers that were accepted 
to be presented orally or as posters during the 27th ICANN conference. The 27th 
ICANN Program Committee was delighted by the overwhelming response to the call 
for papers. All papers went through a peer-review process by at least two and many 
times by three or four independent academic referees to resolve any conflicts. In total, 
360 papers were submitted to the 27th ICANN. Of these, 139 (38.3%) were accepted as 
full papers for oral presentation of 20 minutes with a maximum length of 10 pages, 
whereas 28 of them were accepted as short contributions to be presented orally in 15 
minutes and for inclusion in the proceedings with 8 pages. Also, 41 papers (11.4%) 
were accepted as full papers for poster presentation (up to 10 pages long), whereas 11 
were accepted as short papers for poster presentation (maximum length of 8 pages). 

The accepted papers of the 27th ICANN conference are related to the following 
thematic topics: 


AI and Bioinformatics 
Bayesian and Echo State Networks 
Brain-Inspired Computing 
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Chaotic Complex Models 

Clustering, Mining, Exploratory Analysis 
Coding Architectures 

Complex Firing Patterns 

Convolutional Neural Networks 

Deep Learning (DL) 


— DL in Real Time Systems 
— DL and Big Data Analytics 
— DL and Big Data 

— DL and Forensics 

— DL and Cybersecurity 

— DL and Social Networks 


Evolving Systems — Optimization 
Extreme Learning Machines 
From Neurons to Neuromorphism 
From Sensation to Perception 
From Single Neurons to Networks 
Fuzzy Modeling 

Hierarchical ANN 

Inference and Recognition 
Information and Optimization 
Interacting with the Brain 
Machine Learning (ML) 


— ML for Bio-Medical Systems 

— ML and Video-Image Processing 
— ML and Forensics 

— ML and Cybersecurity 

— ML and Social Media 

— ML in Engineering 


Movement and Motion Detection 

Multilayer Perceptrons and Kernel Networks 
Natural Language 

Object and Face Recognition 

Recurrent Neural Networks and Reservoir Computing 
Reinforcement Learning 

Reservoir Computing 

Self-Organizing Maps 

Spiking Dynamics/Spiking ANN 

Support Vector Machines 

Swarm Intelligence and Decision-Making 
Text Mining 

Theoretical Neural Computation 

Time Series and Forecasting 

Training and Learning 


Preface VI 


The authors of submitted papers came from 34 different countries from all over the 
globe, namely: Belgium, Brazil, Bulgaria, Canada, China, Czech Republic, Cyprus, 
Egypt, Finland, France, Germany, Greece, India, Iran, Ireland, Israel, Italy, Japan, 
Luxembourg, The Netherlands, Norway, Oman, Pakistan, Poland, Portugal, Romania, 
Russia, Slovakia, Spain, Switzerland, Tunisia, Turkey, UK, USA. 

Four keynote speakers were invited, and they gave lectures on timely aspects of Al. 

We hope that these proceedings will help researchers worldwide to understand and 
to be aware of timely evolutions in AI and more specifically in artificial neural net- 
works. We believe that they will be of major interest for scientists over the globe and 
that they will stimulate further research. 
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Keynote Talks 


Cognitive Phase Transitions in the Cerebral 
Cortex - John Taylor Memorial Lecture 


Robert Kozma 


University of Massachusetts Amherst 


Abstract. Everyday subjective experience of the stream of consciousness sug- 
gests continuous cognitive processing in time and smooth underlying brain 
dynamics. Brain monitoring techniques with markedly improved spatio- 
temporal resolution, however, show that relatively smooth periods in brain 
dynamics are frequently interrupted by sudden changes and intermittent dis- 
continuities, evidencing singularities. There are frequent transitions between 
periods of large-scale synchronization and intermittent desynchronization at 
alpha-theta rates. These observations support the hypothesis about the cinematic 
model of cognitive processing, according to which higher cognition can be 
viewed as multiple movies superimposed in time and space. The metastable 
spatial patterns of field potentials manifest the frames, and the rapid transitions 
provide the shutter from each pattern to the next. Recent experimental evidence 
indicates that the observed discontinuities are not merely important aspects of 
cognition; they are key attributes of intelligent behavior representing the cog- 
nitive “Aha” moment of sudden insight and deep understanding in humans and 
animals. The discontinuities can be characterized as phase transitions in graphs 
and networks. We introduce computational models to implement these insights 
in a new generation of devices with robust artificial intelligence, including 
oscillatory neuromorphic memories, and self-developing autonomous robots. 


On the Deep Learning Revolution 
in Computer Vision 


Nathan Netanyahu 


Bar-Ilan University, Israel 


Abstract. Computer Vision (CV) is an interdisciplinary field of Artificial 
Intelligence (AD), which is concerned with the embedding of human visual 
capabilities in a computerized system. The main thrust, essentially, of CV is to 
generate an “intelligent” high-level description of the world for a given scene, 
such that when interfaced with other thought processes can elicit, ultimately, 
appropriate action. In this talk we will review several central CV tasks and 
traditional approaches taken for handling these tasks for over 50 years. Noting 
the limited performance of standard methods applied, we briefly survey the 
evolution of artificial neural networks (ANN) during this extended period, and 
focus, specifically, on the ongoing revolutionary performance of deep learning 
(DL) techniques for the above CV tasks during the past few years. In particular, 
we provide also an overview of our DL activities, in the context of CV, at 
Bar-Ilan University. Finally, we discuss future research and development 
challenges in CV in light of further employment of prospective DL innovations. 


From Machine Learning to Machine 
Diagnostics 


Marios Polycarpou 


University of Cyprus 


Abstract. During the last few years, there have has been remarkable progress in 
utilizing machine learning methods in several applications that benefit from 
deriving useful patterns among large volumes of data. These advances have 
attracted significant attention from industry due to the prospective of reducing 
the cost of predicting future events and making intelligent decisions based on 
data from past experiences. In this context, a key area that can benefit greatly 
from the use of machine learning is the task of detecting and diagnosing 
abnormal behaviour in dynamical systems, especially in safety-critical, 
large-scale applications. The goal of this presentation is to provide insight into 
the problem of detecting, isolating and self-correcting abnormal or faulty 
behaviour in large-scale dynamical systems, to present some design method- 
ologies based on machine learning and to show some illustrative examples. The 
ultimate goal is to develop the foundation of the concept of machine diagnostics, 
which would empower smart software algorithms to continuously monitor the 
health of dynamical systems during the lifetime of their operation. 


Multimodal Deep Learning in Biomedical 
Image Analysis 


Sotirios Tsaftaris 


University of Edinburgh, UK 


Abstract. Nowadays images are typically accompanied by additional informa- 
tion. At the same time, for example, magnetic resonance imaging exams typi- 
cally contain more than one image modality: they show the same anatomy under 
different acquisition strategies revealing various pathophysiological information. 
The detection of disease, segmentation of anatomy and other classical analysis 
tasks, can benefit from a multimodal view to analysis that leverages shared 
information across the sources yet preserves unique information. It is without 
surprise that radiologists analyze data in this fashion, reviewing the exam as a 
whole. Yet, when aiming to automate analysis tasks, we still treat different 
image modalities in isolation and tend to ignore additional information. In this 
talk, I will present recent work in learning with deep neural networks, latent 
embeddings suitable for multimodal processing, and highlight opportunities and 
challenges in this area. 
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Abstract. The heavy storage and computational overheads have 
become a hindrance to the deployment of modern Convolutional Neural 
Networks (CNNs). To overcome this drawback, many works have been 
proposed to exploit redundancy within CNNs. However, most of them 
work as post-training processes. They start from pre-trained dense mod- 
els and apply compression and extra fine-tuning. The overall process is 
time-consuming. In this paper, we introduce redundancy-aware training, 
an approach to learn sparse CNNs from scratch with no need for any 
post-training compression procedure. In addition to minimizing training 
loss, redundancy-aware training prunes unimportant weights for sparse 
structures in the training phase. To ensure stability, a stage-wise prun- 
ing procedure is adopted, which is based on carefully designed model 
partition strategies. Experiment results show redundancy-aware train- 
ing can compress LeNet-5, ResNet-56 and AlexNet by a factor of 43.8x, 
7.9x and 6.4x, respectively. Compared to state-of-the-art approaches, 
our method achieves similar or higher sparsity while consuming signifi- 
cantly less time, e.g., 2.3x-18x more efficient in terms of time. 


Keywords: In-training pruning - Model compression 
Convolutional neural networks - Deep learning 


1 Introduction 


In recent years, convolutional neural networks (CNNs) have been playing an 
important role in the remarkable improvements achieved in a wide range of 
challenging computer vision tasks such as large-scale image classification [11], 
object detection [3], and segmentation [6]. Deploying CNN models in real-world 
applications has attracted increasing interests. 

However, the state-of-the-art accuracy delivered by these CNNs comes at 
the cost of significant storage and computational overheads. For instance, 
AlexNet [11] has 61 million parameters, takes up more than 243 MB of storage 
and requires 1.4 billion floating point operations to classify a 224 x 224 image. 
© Springer Nature Switzerland AG 2018 
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As a result, deploying CNNs on devices with limited resources, such as mobile 
phones and wearable devices, could be infeasible. 

Since large CNNs are highly over-parameterized [2], many methods have been 
proposed to compress them. Pruning methods have attracted much attention 
due to its simplicity and effectiveness. However, most of these methods work as 
post-training processes. Based on dense pre-trained models, unimportant con- 
nections and neurons are pruned to reduce the model size and the computational 
complexity. The following fine-tuning step is responsible for compensating the 
accuracy loss. The pruning and fine-tuning steps may be repeated several times 
for a good balance between accuracy and sparsity (the ratio of pruned weights). 
Some methods introduce sparsity-inducing regularizers to learn sparse structures 
from a pre-trained dense model. The overall process consumes significant time 
to get sparse models, resulting in poor time efficiency as summarized in Table 1. 

In this paper, we propose redundancy-aware training, which can exploit 
redundancy efficiently by learning both sparse neural network structures and 
weight values from scratch. Besides minimizing training loss, it prunes unimpor- 
tant connections for sparse structures. Varying structure may bring difficulty 
in achieving good accuracy. Redundancy-aware training solves this problem by 
adopting a stage-wise pruning procedure. It leverages novel partition strategies 
to divide the network into layer classes. The pruning starts from one class in 
the first stage and extends to the left classes in following stages. Our train- 
ing method yields sparse and accurate models when it finishes. Evaluations 
on several datasets, including MNIST, CIFAR10 and ImageNet, demonstrate 
our redundancy-aware training can achieve state-of-the-art compression results. 
Meanwhile, our method is much more efficient in terms of time as it requires 
neither extending normal training iterations nor any post-training compression 
procedure. 


Table 1. Time breakdown of some pruning methods. For post-training methods, we 
show epochs spent in the training phase (Training) and the post-training phase (Post- 
Training). For in-training pruning methods (denoted by x), we report the epochs taken 
by the method (Training) and the normal training epochs (Normal). 


Method CNN Dataset | Training | Post-training | Normal 
DC [5] AlexNet ImageNet 90 >960 

DNS [4] | LeNet-5 MNIST 11 17 

NISP [18] | GoogLeNet | ImageNet | 60 60 

LSN* [14] | LeNet-5 MNIST 200 11 
NSN* [10] | ResNet-56 | CIFAR10 | 205 164 
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2 Related Work 


According to whether pre-trained models are required, we divide existing pruning 
methods into two categories: post-training methods and in-training methods. 


Post-training Pruning. Deep compression [5] prunes trained CNNs through 
a magnitude-based weight pruning method, showing a significant reduction 
in model size. DNS [4] improves deep compression [5] by allowing the recov- 
ery of pruned weights. NISP [18] prunes unimportant neurons based on its 
neuron importance estimation. SSL [17] makes use of group lasso regulariza- 
tion to remove groups of weights, e.g., channels, filters, and layers, in CNNs. 
Compression-aware training [1] takes post-training compression into account in 
the training phase. A regularizer is added to encourage the weights to have lower 
rank. These methods often suffer from poor time efficiency. Table 1 lists time 
taken by some pruning methods. We can see the post-training compression pro- 
cedure takes considerable time. Redundancy-aware training adopts in-training 
pruning, thus improving the time efficiency significantly. 


In-training Pruning. AL [15] introduces binary parameters to prune neu- 
rons and layers. A binarizing regularizer is used to attract them to 0 or 1. 
Similar approach as [15] is adopted to prune weights in [16]. The above two 
methods only evaluate the in-training compression ability on small datasets. 
Method attempting to use Ly regularization to directly learn sparse structures is 
proposed in [14]. To enable gradient-based optimizations, approximation of the 
non-differentiable Ly norm is added to the loss. But more training iterations are 
required (See Table 1). Redundancy-aware training adopts pruning approach to 
remove redundant weights. By incorporating stage-wise pruning within training 
process, our method outperforms other in-training pruning works in terms of 
both compression results and time efficiency. 


[oas [o| oas] 


fef 
(b) 


(a) (c) 


Fig. 1. Pruning (b) with u = 0.2 and l = 0.1. Weights marked red are pruned. The 
pruning states of the last iteration and this iteration are shown in (a) and (c) respec- 
tively. (Color figure online) 
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Algorithm 1. Redundancy-Aware Training 
Input: CNN to train network 
the maximum number of training iterations max_iterations 
the interval of extending pruning to the next class extending_interval 
Output: network trained by redundancy-aware training 
1: divide network into layer classes based on the partition strategies: 
classes — [C1,C2,..., Cm } 


2:10 

3: pruning-classes — {} 

4: initialize network 

5: while i < max_iterations do 

6 if mod(i, extending-interval) = 0 then 
7 c — classes.pop() 

8 append c to pruning_classes 


9: end if 

10: forward and backward through network 
11: update weights in network 

12: for each class c in pruning_classes do 
13: for each layer l in c do 

14: pruning layer I 

15: end for 

16: end for 


17: i— i+1 
18: end while 


3 Redundancy-Aware Training 


In this section, we introduce our redundancy-aware training method. The 
overview of the proposed method is displayed in Algorithm 1. For a given CNN, 
redundancy-aware training first divides it into layer classes based on the partition 
strategies. In each training iteration, it prunes layers in pruning-classes after 
the update of weights. More classes will be appended into the pruning-classes as 
training proceeds. We first introduce how to prune unimportant weights during 
training. Then, we present the model partition strategies. 


3.1 Pruning Weights During Training 


As the pruning works on each layer independently, we take pruning one layer as 
an example to illustrate the in-training pruning. 

Let us denote the parameters of a layer by K. Redundancy-aware training 
adopts a magnitude-based pruning approach. Specifically, two thresholds u and | 
are introduced. In each iteration, weights with absolute value below l are pruned, 
while others with magnitude above u are kept. Weights with absolute value in 
the range of [l,u] are skipped in this iteration and their pruning states stay 
unchanged. To reduce the risk of pruning important weights wrongly, we use 
the update scheme in [4] where pruned weights can also be updated in the 
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back-propagation. This scheme enables the recovery of wrongly pruned weights. 
Figure 1 shows an example. 

To avoid tuning u and I! for each layer manually, we choose to compute them 
based on K as shown in Eq.1. y and o represent the mean and the standard 
variation of K, respectively. Two hype-parameters range and e are introduced to 
provide more flexibility. Increasing range will make l larger, resulting in pruning 
more weights from network. € is a small positive value and controls the difference 
between u and l. We analyze how u and ø influence the compression results in 
Sect. 4.2. 


u = max(u + o(range + €),0) 


(1) 


l = max(u + o(range — e), 0). 


o 10 20 30 40 50 60 o 10 20 30 40 50 
layer layer 


(a) layer sparsity (b) sensitivity 


Fig. 2. Sparsity and sensitivity of layers in ResNet-56. The shapes of sparsity lines of 
different training time are quite similar, indicating the difference of sparsity between 
layers stays stable during training. Based on the sensitivity, ResNet-56 is divided into 
three classes as shown by the black vertical lines in (b). 


3.2 Model Partition 


In-training pruning allows learning sparse structures during the training phase. 
However, pruning all layers in network simultaneously causes instability and 
slows down the learning process, resulting in difficulty in reaching as good accu- 
racy as the normal training. 

Redundancy-aware training adopts a stage-wise pruning procedure. The 
pruning scope in each stage is orchestrated by our model partition strategies. 
When layers within the pruning scope are being pruned, the left layers can adapt 
to it and alleviate the impact through updating their weight values. Formally, 
we call the unit of adjusting the pruning scope “class”. A class contains several 
consecutive layers. Based on our model partition strategies, redundancy-aware 
training divides the CNN into classes. Then, the in-training pruning starts from 
the first class and extends to one more class at the beginning of each of the 
following stages. Both layer by layer pruning and pruning all layers together are 
special cases of our approach. 
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Partition Strategy. We propose two heuristic strategies for two different types 
of CNN. The first type is called simple CNN, which refers to networks composed 
of stacked convolution layers and several fully-connected layers. LeNet-5 [12] and 
AlexNet [11] fall into this category. For simple CNN, the partition strategy is: 

Strategy1: Layers with the same type are divided into the same class. 

Thus, simple CNNs will be divided into two classes. The first class contains 
convolution layers and fully-connected layers belong to the second class. Strat- 
egyl is not applicable to recently designed CNNs, which tend to avoid using 
fully-connected layers. For example, ResNet [7] has only one fully-connected 
layer to produce the possibilities over given number of classes. Inspired by [13] 
which prunes filters based on the analysis of layer sensitivity to pruning, we 
propose the second strategy for these CNNs: 


Algorithm 2. Partition Strategy? 


Input: sensitivity difference threshold 6 
layers’ sensitivity to pruning s|...] 
layers in given network layersl...] 

Output: the partition result of network 

1: c + {layers{1]} 

2: s_avg — s[1] 

3: for l — 2 to layers.size do 

4: diff — abs(s{l] — s_avg) 

5 if diff > ô then 

6 set c a new partition class 
7: end if 
8 

9 
10: 


add layers|!] to c 
update s_avg to the average sensitivity of layers in c 
end for 


Strategy2: Divide model at layers which are quite sensitive to pruning. 

Algorithm 2 illustrates how this strategy works. The sensitivity to pruning is 
determined through our proposed ‘probe’ phase which is described in the next 
section. We also analyze the impact of 6 in Sect. 4.2. 


Determine Layer’s Sensitivity Efficiently. The in-training pruning zeros 
out unimportant weights. Layers with relatively low sparsity should be impor- 
tant and sensitive to pruning. Thus, we define layer’s sensitivity as the reciprocal 
of its sparsity achieved by the in-training pruning. A naive but inefficient app- 
roach to determine the sensitivity works as follows. We train the CNN with all 
layers under in-training pruning and use the layer’s sparsity after training to 
compute the sensitivity. Based on a key observation, we propose a more efficient 
approach. Figure 2a shows the sparsity of ResNet-56 at different time of training. 
The relative sparsity between layers is actually quite stable in training. As the 
partition result only depends on the difference of sparsity between layers, we can 
use the sparsity at early training time to obtain the partition result. 
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More precisely, we introduce a probe phase where the CNN is trained with all 
layers under the in-training pruning. When the probe phase finishes, we compute 
layer’s sensitivity based on its sparsity, which is then used by the strategy2. In our 
experiments, we find tenth of the training time is sufficient for the probe phase. 
Figure 2b shows the sensitivity of layers in ResNet-56. It’s noticeable that layers 
of residual blocks where the number of output channels changes are sensitive to 
pruning. This discovery is consistent with the results reported in [13]. 


Table 2. Comparison to other compression works. Results of our method are denoted 
by RA-range-e. The result of DC for ResNet-56 is provided in [10]. The result of PF is 
based on our implementation. The scratch-train models show notable accuracy drops, 
demonstrating the difficulty of training a sparse network from scratch. 


Network |In-training Baseline Accuracy Sparsity Post-training |Baseline Accuracy Sparsity 
methods accuracy [change methods accuracy [change 

LeNet-5 |LNA [15] 99.3% —0.23% 90.5% |SSL [17] 99.1% —0.1% 75.1% 
LSN [14] 99.1% 0 90.7% DC [5] 99.2% +0.03% 92% 
TSNN [16] 99.2% —0.01% 95.8% [DNS [4] 99.1% 0 99.91% 
RA-2-0.1 99.1% 0 97.7% Scratch-train |99.1% —1.5% 97.7% 

ResNet-56|NCP [10] 93.4% —0.5% 50% CP [8] 92.8% —1.0% 50% 
NWP [10] 93.4% —0.6% 66.7% |PF [13] 92.4% — 1.04% 62% 
RA-1.8-0.1 92.4% —0.1% 87.4% DC [5] 93.4% —0.8% 66.7% 
RA-3.0-0.1 92.4% —1.0% 92.1% Scratch-train 92.4% —2.8% 87.4% 


4 Evaluation 


In this section, we evaluate redundancy-aware training on MNIST, CIFAR10, 
and ImageNet with LeNet-5, ResNet-56, and AlexNet, respectively. First, we 
compare the compression result and the time efficiency with state-of-the-art 
compression methods. The compression result includes achieved sparsity and 
accuracy loss. Sparsity is defined as the percentage of the zeroed out weights. We 
compare the time efficiency based on the number of iterations or epochs required 
to obtain sparse models. Then, we analyze the effectiveness of the model partition 
and the effect of hyper-parameters in Sect. 4.2. We implement our method in 
Caffe [9]. 


4.1 Compression Result and Time Efficiency 


The comparison to other methods on LeNet-5 and ResNet-56 is summarized 
in Table 2. We also train models with the same sparsity as the models trained 
through redundancy-aware training from scratch (the scratch-train). 
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LeNet-5. Redundancy-aware training reduces the model size of LeNet-5 by 
43.8x without accuracy loss and outperforms all in-training methods by a 
notable margin, validating its ability to reduce redundancy in the training phase. 
Compared to post-training methods, redundancy-aware training achieves higher 
or similar sparsity. Our method prunes more weights in every layer than [14] 
and [5]. As for time efficiency, our method only takes 11 epochs which is equal 
to the normal training time and is about 18x more efficient than the in-training 
method in [14] and 2.5x more efficient than the method in [4]. 


ResNet-56. Based on the strategy2 in Sect. 3.2, ResNet-56 is divided into 
three classes. We extend the in-training pruning at 10k and 20k iterations. 
Redundancy-aware training achieves a 7.9x reduction with only 0.1% top-1 
accuracy drop. Importantly, our method achieves this without any post-training 
procedures. By using a larger range, we can achieve a 12.6x compression at the 
cost of 1.13% accuracy loss, which can be reduced to 1% after a fine-tuning of 20k 
iterations. As far as we know, our method achieves state-of-the-art compression 
result for ResNet-56. In terms of time-efficiency, our method takes 70k iterations 
(64k for training and 6k for the probe phase), which is about 2.3x more efficient 
than NWP in [10] and PF in [13]. 


Table 3. Layer-by-layer comparison to deep compression on AlexNet. 


Method/layer | conv1 | conv2 | conv3 | conv4 | conv5| fel | fc2 |fc3 | Total 
DC 16% |62% 65% | 63% | 63% | 91% | 91% | 75% | 89% 
Ours 31% |65% 69% | 63% | 61% |88% | 81% | 80% | 84% 


AlexNet. Finally, we experiment with AlexNet on ImageNet. We train the 
bvlc_alexnet in Caffe and get 78.65% top-5 accuracy on validation dataset with 
single-view testing. Redundancy-aware training reduces the model size by 6.4x 
with 0.36% accuracy loss. We further fine-tune it for 45k iterations and obtain a 
model with 78.54% accuracy. We display sparsity achieved by our method and 
DC [5] in Table3. Our method takes 99 epochs in total, which is 9.69x more 
efficient in terms of time. 


4.2 Ablation Study 


Hyper-parameter Sensitivity. We make use of ResNet-56 to measure the 
impact of varying range and e. The result is shown in Fig. 3. 

Increasing range leads to larger l and more weights will be pruned in training. 
Thus we can make trade-offs between the sparsity and the accuracy through 
adjusting range. Note the accuracy does not drop dramatically (2.2% drop) 
when range increases from 0 to 3.5. Since increasing e makes l smaller, weights 
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Fig. 3. Impact of hyper-parameters range and e. The model is divided into three 
partition classes. 


are less likely to be pruned and the sparsity decreases. We can observe that the 
accuracy does not change drastically for a wide range of e. 


Table 4. Accuracy with varying ô. 


ö +00 |0.5*s_avg | 0.4 x s_aug | 0.3*s_avg 
# partition classes | 1 2 3 5 
Accuracy 91.3% 91.7% [923% 90.2% 
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Fig. 4. Effect of partition with varying ranges. 


Effectiveness of Partition Strategies. We first analyze the impact on accu- 
racy with different number of partition classes. To this end, we fix range = 1.8 
and e = 0.1 and vary 6 to change the partition result. Results are shown in 
Table 4. When 6 is set to +00, all layers belong to the same class and the net- 
work is pruned all through the training phase, which shows a 1.1% accuracy drop. 
Dividing ResNet-56 into two or three classes improves accuracy. The model with 
five classes has inferior accuracy, implicating too many classes result in insuffi- 
cient training iterations in each stage. 

We also verify the effectiveness of model partition with varying ranges. 
Results are shown in Fig. 4. The model partition helps to improve accuracy over 
a wide scope of ranges, confirming the benefit of our model partition approach 
in stabilizing training and helping in good convergence. 
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5 Conclusion 


In this paper, we propose an in-training compression method, redundancy-aware 
training. Our method can learn both sparse connections and weight values from 
scratch. We highlight our redundancy-aware training achieves state-of-the-art 
compression results without any post-training compression procedures and con- 
sumes significantly less time when compared to other methods. 
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Abstract. Mulitimudal matching aims to establish relationship across 
different modalities such as image and text. Existing works mainly focus 
on maximizing the correlation between feature vectors extracted from 
the off-the-shelf models. The feature extraction and the matching are 
two-stage learning process. This paper presents a novel two-stream con- 
volutional neural network that integrates the feature extraction and the 
matching under an end-to-end manner. Visual and textual stream are 
designed for feature extraction and then are concatenated with multiple 
shared layers for multimodal matching. The network is trained using an 
extreme multiclass classification loss by viewing each multimodal data 
as a class. Then a finetuning step is performed by a ranking constraint. 
Experimental results on Flickr30k datasets demonstrate the effectiveness 
of the proposed network for multimodal matching. 


Keywords: Multimodal matching - Two-stream network 
Convolutional neural network 


1 Introduction 


Multimodal analysis has received ever-increasing research focus due to the explo- 
sive growth of multimodal data such as image, text, video and audio. A core 
problem for multimodal analysis is to mine the internal correlation across differ- 
ent modalities. In this paper, we focus on the image-text matching. For example, 
given a query image, our aim is to retrieve the relevant texts in the database 
that best illustrate the image. There are two major challenges in multimodal 
matching: (1) effectively extracting the feature from the multimodal data; (2) 
inherently correlating the feature across different modalities. 

Previous works for multimodal matching prefered to adopt off-the-shelf mod- 
els to extract the features rather than learn modality-specific features. For the 
image, some well-known hand-crafted feature extraction techniques such as SIFT 
[1], GIST [2] were widely used. Inspired by recent breakthroughs of convolu- 
tional neural network (CNN) in visual recognition, CNN visual features were 
also introduced to multimodal matching [14]. For the text, latent Dirichlet allo- 
cation (LDA) [3] and word2vec [18] models were two typical choices for vec- 
torization. Despite their contributions to the multimodal matching, off-the-shelf 
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models suffer from some weaknesses. They are not specific designed for the task 
of multimodal matching. That is, these features are not discriminative enough, 
which limits the final matching performance. 


Image Feature 
Image C 


Ranking Loss 
A female tennis Pa 
player in a white 


shirt and black 


tennis skirt getting Text Feature 
ready to swing. 


: | Softmax 
` | Classifier 


Fig. 1. Overview of the proposed two-stream convolutional neural network. 


Another challenge is to correlate these multimodal features. Most deep learn- 
ing based methods [4,5] are highly dependent on the categorical information for 
network training. However, such high-level semantic information is absent in 
most scenarios and requires much manual labels. Furthermore, the explosive 
increase of data makes it unrealistic to label each data with a certain category. 
Luckily, co-occurred data usually delivers correlated information (i.e. image-text 
pair information). The pair information is relatively easy to be obtained via the 
web crawler and should be fully explored for multimodal matching. 

To address above issues, we propose a novel two-stream convolutional neural 
network as shown in Fig. 1, which extracts visual and textual representations and 
simultaneously performs the task of multimodal matching. Thus the similarity 
between images and texts can be measured directly according to the learned 
representations. More specifically, CNN is the backbone to extract the feature 
from the raw images and texts respectively. The outputs of the two stream 
are concatenated and followed by several shared fully connected layers. The 
final output of the network is the class probabilities after a softmax regression. 
To train the network, we adopt an extreme multiclass classification loss and a 
ranking loss both based on the pair information. 

The remainder of this paper is structured as follows. Section 2 reviews the 
related work. Section 3 presents our two-stream network for multimodal match- 
ing and its learning process, followed by experimental results in Sect. 4. Section 5 
draws an overall conclusion. 


2 Related Work 


The core issue for multimodal matching is to learn discriminative and joint 
image-text representations. Canonical correlation analysis (CCA) [7] and cross- 


16 Y. Zhang et al. 


modal factor analysis (CFA) [8] were two classic methods. They linearly pro- 
jected vectors from the two views into a shared correlation maximum space. 
Andrew et al. proposed deep CCA [12] to learn the nonlinear transformation 
through two deep networks, whose outputs are maximally correlated. Yan et al. 
[13] further introduced DCCA into image-text matching. 

Inspired by recent breakthroughs in visual recognition, CNN was also widely 
employed in multimodal matching. Wei et al. [14] provided a new baseline for 
cross-modal retrieval with CNN visual features instead of traditional SIFT [1] 
and GIST [2] features. CNN has also shown its powerful abilities in natural lan- 
guage processing. Hu et al. [10] proposed a sentence matching model based on 
CNN that represented the sentence and captured the matching relation simul- 
taneously. In [9], convolutional architectures were first employed to learn the 
correlation between image and sentence by encoding their separate representa- 
tions into a joint one. 

There are also some deep models related to our work. In [6], a three-stream 
deep convolutional network was proposed to generate a shared representation 
across image, text, and sound modality. Wang et al. [15] presented a two-branch 
network to learn the image-text joint embedding. The network was trained by an 
extended ranking constraint and only received the input of feature vectors. Mao 
et al. [16] proposed a multimodal Recurrent Neural Network (m-RNN) model for 
image captioning and cross-modal retrieval. [17] presented a selective multimodal 
network that incorporated attention and recurrent selection mechanism based 
on long short term memory. 


3 Two-Stream CNN 


3.1 Network Architecture 


Overall Architecture. As exhibited in Fig.1, the overall architecture of the 
proposed network contains two parts. The color part with two streams focuses 
on the feature extraction from the raw image and text. The gray one integrates 
the feature vectors from different modalities with shared weights and fully con- 
nected layers for further multimodal matching. In general, to generate a joint 
representation, the color part is specific to modality but gray one is shared across 
modalities. 


Image Stream. We adopt a 50-layer ResNet model [11] pretrained on ImageNet 
classification tasks as the visual CNN. We discard the top fully connected layer 
designed for ImageNet. Thus, given a raw image resized to 224 x 224, a 2048- 
dim vector considered as the image representation is produced by the model 
after average pooling. 


Text Stream. Since each image can be represented by a fixed-length vector 
with CNN, we also design a textual CNN with three convolutional layers to 
vectorize the text as shown in Fig. 2. Text is first encoded into al x n x d 
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Fig. 2. Overview of the textual CNN stream. 


numerical matrix T, where n is the length of the sentence and d is the size of 
the vocabulary. The vocabulary contains all tokens appeared in the corpus. Let 
w; be the i-th word in the vocabulary, thus w; can be converted into a one-hot 
high-dimensional sparse vector v; where the i-th element is set to be 1 and rests 
to be 0. Then the embedding layer turns each v; into a low-dimensional dense 
word embedding e; with the length of k via a lookup table. Thus, each sentence 
is encoded into a 1 x n x k matrix. 

Though embedding layer encodes the semantic information of each word into 
vectors, simply concatenating word vectors ignores many subtleties of a possible 
good representation, e.g. consideration of word ordering. Therefore, following 
convolutional layers are employed to extract the word sequence information of 
the words. In each convolutional layer, the context in the sentence is modeled 
using two convolution kernels of size 1 x 2 and 1 x 3, respectively. And the 
outputs of two convolutional operations are concatenated directly, fed into fol- 
lowing layers. At the end of network, a pooling layer with dropout is used to 
produce final output, which matches the size of image features. Convolutional 
layers combined with word embedding ensure that the output feature contains 
most necessary information to effectively represent sentences for further multi- 
modal matching. 


3.2 Network Learning 


Objective Function. Supervised semantic labels usually play an important 
role in deep neural network learning. However, the lack of labels poses a unique 
challenge to multimodal matching: how to effectively utilize the only image-text 
pair information. In this paper, we transform the multimodal matching into an 
extreme multiclass classification task where the matching becomes accurately 
classifying a specific data among tens of thousands classes. Here, each mul- 
timodal document including an image and corresponding text is viewed as a 
pseudo class. Given an instance x’, we apply the softmaz function to the output 
of the network z € R**” (n is the number of multimodal document). Thus, we 
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can obtain the posterior probability of the instance being classified into the right 
category c. It can be formally written as Eq. (1). 


Ze 
ia er 


Then we minimize the negative log-likelihood P(c|x*), defined as Eq. (2). 


P(clx*) = softmaa(z) = 


(1) 


Leis = —log(P(c|x")). (2) 


To obtain more discriminative representations, we also performed a metric 
learning based on a ranking constraint. Pair of distances in the feature space 
between x? and x” against the anchor x° should be pulled apart up to a margin 
a (a = 0.1 in our case) as d(x,z?) + a < d(x%,x”). Instances sharing the 
same pseudo class with x% are defined as xP, otherwise, x”. We compute the 
cosine distance between the feature vectors (v;,v;) of two instances (xf, 2) as 
d(x*, xi) = 1— EAMA We further define the bi-directional ranking constraint 
with a hinge loss for the given image reference (ing, Ttrt Cr) and the text 


reference (ft, Limg: Limg) respectively as Eq. (3). 


Lrank = max {0, Amo Mi) > d(Timg Tist) a a} 


+maz{0, d( Lent» Lmg) u UL Timia) T a}. 


(3) 


The final objective function is a weighted combination of the classification 
loss and ranking loss as Eq. (4). 


L= Ai Lets + A2Lrank: (4) 


Training Scheme. Network training is done in three steps. Firstly, we fix the 
image stream and train the remaining part using the classification loss (Aa = 0, 
only text data is used). The reason behind is that pre-trained weights on Ima- 
genet can be used for image stream but weights of the remaining part have to 
be learned from scratch. Secondly, we update the weights of the entire network 
after step 1 converges (Aa = 0, both text and image data are used). Considering 
that ranking loss usually converges very slowly or even does not converge espe- 
cially in two-stream network learning, we fine-tune the entire network using the 
combination of the classification loss and ranking loss (A; = 1, A2 = 1) only in 
the last step. 


4 Experiment 


4.1 Datasets and Evaluation Metrics 


We choose widely-used Flickr30k [19] for experiments. Flickr30k contains 31,783 
images collected from website Flickr. Each image is described with five sen- 
tences. We follow the partition scheme in [16,17], where 29,783, 1,000, and 1,000 
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images are used for training, validation, and test respectively. R@k and Med r 
are adopted as evaluation metrics. R@k is the average recall rate over all queries 
in the test set. Specifically, given a query, the recall rate will be 1 if at least one 
ground truth occurs in the top-k returned results and 0 otherwise. Med r is the 
median rank of the closest ground truth in the ranking list. 


4.2 Implementation Details 


For Flickr30k, the vocabulary size d is 20,074, and each word is encoded into a 
300-dim dense vector. To ensure that each input sentence has the same length 
of 32, we use 0 vectors as paddings for those short sentenses. And we use the 
pre-trained vectors of the word2vec [18] model to initialize our embedding layer. 
The network is optimized by backpropagation and mini-batch stochastic gradient 
descent with the momentum fixed to 0.9. For the three training steps, learning 
rate is set to 0.001. 0.0001 and 0.00005 respectively. The maximum epochs are 
set to 180, 60 and 20 accordingly. In our experiments, we observe convergence 
within 150, 30, 10 epochs. 


4.3 Experimental Results 


We consider two basic multimodal tasks: Img2Txt (an image query to retrieve 
texts) and Txt2Img (a text query to retrieve images). Table 1 presents the exper- 
imental results of different methods in terms of R@k and Med r. The proposed 
network outperforms other methods in the Img2Txt task with the highest R@1 
of 48.4%. In the Txt2Img task, R@1 obtained by our method is only 0.7% lower 
than the best method RBF-Net [20]. The results indicate that the learned fea- 
tures are effective for multimodal matching. The superiority of our network can 
be explained by the following two aspects: (1) We simultaneously perform fea- 
ture extraction and multimodal matching. Compared with off-the-Shelf models, 
the learned features are more targeted for the matching task instead of previous 
generic representations; (2) We fully explore the image-text pair information via 
the classification and ranking loss to generate more discriminative representa- 
tions. 

We also conduct experiments to analyze the effect of the training scheme. 
Step 1 only trains the text stream using the classification loss and directly 
adopts the image features extracted from pre-trained ResNet-50. Step2 trains 
the entire network using the classification loss, which encourages instance from 
the same document to fall into one category. Thus, results obtained from step 
2 gains a great increase of about 10%, 6% on R@1 in the bidirectional retrieval 
respectively. Step 3 combines ranking constraints to further finetune the network, 
which provides a higher performance for the final model. 

Another issue to be noticed is that the improvement brought by step 2 is not 
as impressive as that by step 3. On the one hand, that illustrates the effectiveness 
of posing multimodal matching as a classification problem. On the other hand, 
considering the effectiveness of ranking loss in previous works, there could be 
space for improvement in our network especially the weakness of R@5 and R@10. 
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Table 1. Bidirectional image and text retrieval results on Flickr30K. 


Methods Img2Txt Txt2Img 

R@1 | R@5 | R@10 | Med r | R@1 | R@5 | R@10 | Med r 
DCCA [13] 16.7 [39.3 |52.9 |8 12.6 31.0 43.0 |15 
m-CNN [9] 33.6 |64.1 | 74.9 |3 26.2 56.3 | 69.6 
m-RNN [16] | 35.4 | 63.8 173.7 13 22.8 | 50.7 63.1 |5 
2-branch [15] | 40.3 | 68.9 | 79.9 |- 29.7 60.1 72.1 |- 
sm-LSTM [17] | 42.5 [71.9 81.5 |2 30.2 60.4 723 |3 
RBF-Net [20] | 47.6 77.4 87.1 |- 35.4 68.3 79.9 |- 
Ours (step 1) | 38.4 | 68.4 79.3 |2 28.4 |56.1 | 68.2 |4 
Ours (step 2) | 46.8 | 75.7 185.6 | 2 33.5 63.0 749 |3 
Ours (step 3) 48.4 |77.2 85.9 |2 34.7 64.9 76.4 |3 


Ranking loss requires a careful triplet sampling strategy from the extremely 
unbalanced positive and negative ones, which points out the direction of our 
future work. 


5 Conclusion 


This paper mainly addresses the issue of multimodal matching via a novel two- 
stream convolutional neural network. The proposed network can extract the 
features from the raw image and text. To guarantee the features shared between 
different modalities, a classifier and ranking constraint are adopted for network 
learning by utilizing the pair information. Experimental results on Flickr30k 
datasets demonstrate the effectiveness of viewing each multimodal document as 
a discrete class. For further research, the ranking constraint will be polished to 
perform a more effective metric learning. Also, more detailed experiments on the 
Microsoft COCO datasets will be conducted to further validate the validity of 
our network. 
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Abstract. Graph kernels have been successfully applied to many graph 
classification problems. Typically, a kernel is first designed, and then 
an SVM classifier is trained based on the features defined implicitly by 
this kernel. This two-stage approach decouples data representation from 
learning, which is suboptimal. On the other hand, Convolutional Neu- 
ral Networks (CNNs) have the capability to learn their own features 
directly from the raw data during training. Unfortunately, they cannot 
handle irregular data such as graphs. We address this challenge by using 
graph kernels to embed meaningful local neighborhoods of the graphs in 
a continuous vector space. A set of filters is then convolved with these 
patches, pooled, and the output is then passed to a feedforward network. 
With limited parameter tuning, our approach outperforms strong base- 
lines on 7 out of 10 benchmark datasets. Code and data are publicly 
available (https: //github.com/giannisnik/cnn-graph-classification). 


1 Introduction 


Graphs are powerful structures that can be used to model almost any kind 
of data. Social networks, textual documents, the World Wide Web, chemical 
compounds, and protein-protein interaction networks, are all examples of data 
that are commonly represented as graphs. As such, graph classification is a very 
important task, with numerous significant real-world applications. However, due 
to the absence of a unified, standard vector representation of graphs, graph 
classification cannot be tackled with classical machine learning algorithms. 

Kernel methods offer a solution to those cases where instances cannot be 
readily vectorized. The trick is to define a suitable object-object similarity func- 
tion (known as a kernel function). Then, the matrix of pairwise similarities can 
be passed to a kernel-based supervised algorithm such as the Support Vector 
Machine to perform classification. With properly crafted kernels, this two-step 
approach was shown to give state-of-the-art results on many datasets [12], and 
has become standard and widely used. One major limitation of the graph kernel 
+ SVM approach, though, is that representation and learning are two indepen- 
dent steps. In other words, the features are precomputed in separation from the 
training phase, and are not optimized for the downstream task. 
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Conversely, Convolutional Neural Networks (CNNs) learn their own features 
from the raw data during training, to maximize performance on the task at 
hand. CNNs thus provide a very attractive alternative to the aforementioned 
two-step approach. However, CNNs are designed to work on regular grids, and 
thus cannot process graphs. 

We propose to address this challenge by extracting patches from each input 
graph via community detection, and by embedding these patches with graph 
kernels. The patch vectors are then convolved with the filters of a 1D CNN and 
pooling is applied. Finally, to perform graph classification, a fully-connected layer 
with a softmax completes the architecture. We compare our proposed method 
with state-of-the-art graph kernels and a recently introduced neural architecture 
on 10 bioinformatics and social network datasets. Results show that our Kernel 
CNN model is very competitive, and offers in many cases significant accuracy 
gains. 


2 Related Work 


Graph Kernels. A graph kernel is a kernel function defined on pairs of graphs. 
Graph kernels can be viewed as graph similarity functions, and currently serve 
as the dominant tool for graph classification. Most graph kernels compute the 
similarity between two networks by comparing their substructures, which can 
be specific subgraphs [13], random walks [16], cycles [6], or paths [2], among 
others. The Weisfeiler-Lehman framework operates on top of existing kernels 
and improves their performance by using a relabeling procedure based on the 
Weisfeiler-Lehman test of isomorphism [12]. Recently, two other frameworks were 
presented for deriving variants of popular graph kernels [18,19]. Inspired by 
recent advances in NLP, they offer a way to take into account substructure sim- 
ilarity. Some graph kernels not restricted to comparing substructures of graphs 
but that also capture their global properties have also been proposed. Exam- 
ples include graph kernels based on the Loväsz number and the corresponding 
orthonormal representation [7], the pyramid match graph kernel that embeds ver- 
tices in a feature space and computes an approximate correspondence between 
them [11], and the Multiscale Laplacian graph kernel, which captures similarity 
at different granularity levels by considering a hierarchy of nested subgraphs [9]. 


Graph CNNs. Extending CNNs to graphs has experienced a surge of interest 
in recent years. A first class of methods use spectral properties of graphs. An 
early generalization of the convolution operator to graphs was based on the 
eigenvectors of the Laplacian matrix [3]. A more efficient model using Chebyshev 
polynomials approximation to represent the spectral filters was later presented 
[4]. All of these methods, however, assume a fixed graph structure and are thus 
not applicable to our setting. The model of [4] was then simplified by using a 
first-order approximation of the spectral filters [8], but within the context of 
a node classification problem (which again, differs from our graph classification 
setting). Unlike spectral methods, spatial methods [10, 15] operate directly on the 
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Fig. 1. Overview of our Kernel Graph CNN approach. 


topology of the graph. Finally, some other techniques rely on node embeddings 
obtained as an unsupervised pre-processing step, like [14], in which graphs are 
represented as stacks of bivariate histograms and passed to a classical 2D CNN 
for images. 

The work closest to ours is probably [10]. To extract a set of patches from the 
input graph, the authors (1) construct an ordered sequence of vertices from the 
graph, (2) create a neighborhood graph of constant size for each selected vertex, 
and (3) generate a vector representation (patch) for each neighborhood using 
graph labeling procedures such that nodes with similar structural roles in the 
neighborhood graph are positioned similarly in the vector space. The extracted 
patches are then fed to a 1D CNN. In contrast to the above work, we extract 
neighborhoods of varying sizes from the graph in a more direct and natural way 
(via community detection), and use graph kernels to normalize our patches. We 
present our approach in more details in the next section. 


3 Proposed Approach 


In what follows, we present the main ideas and building blocks of our model. 
The overarching process flow is illustrated in Fig. 1. 


3.1 Patch Extraction and Normalization 


Many types of real-world data are regular grids, and can thus be decomposed 
into units that are inherently ordered along spatial dimensions. This makes the 
task of patch extraction easy, and normalization unnecessary. For example, in 
computer vision (2D), meaningful patches are given by instantiating a rectangle 
window over the image. Furthermore, for all images, pixels are uniquely ordered 
along width and height, so there is a correspondence between the pixels in each 
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patch, given by the spatial coordinates of the pixels. This removes the need for 
normalization. Likewise, in NLP, words in sentences are uniquely ordered from 
left to right, and a 1D window applied over text provides again natural regions. 
However, graphs do not exhibit such an underlying grid-like structure. They are 
irregular objects for which there exist no canonical ordering of the elementary 
units (nodes). Hence, generating patches from graphs, and normalizing them so 
that they are comparable and combinable, is a very challenging problem. To 
address these challenges, our approach leverages community detection and graph 
kernels. 


Patch Extraction with Community Detection. There is a large variety of 
approaches for sampling from graphs. We can extract subgraphs for all vertices 
(which may be computationally intractable for large graphs) or for only a subset 
of them, such as the most central ones according to some metric. Furthermore, 
subgraphs may contain only the hop-1 neighborhood of a root vertex, or vertices 
that are further away from it. They may also be walks passing through the root 
vertex. A more natural way is to capitalize on community detection algorithms 
[5], as the clusters correspond to meaningful graph partitions. Indeed, a commu- 
nity typically corresponds to a set of vertices that highly interact with each other, 
as expressed by the number and weight of the edges between them, compared to 
the other vertices in the graph. In this paper, we employ the Louvain clustering 
algorithm, which extracts non-overlapping communities of various sizes from a 
given graph [1]. This multilevel algorithm aggregates each node with one of its 
neighbors such that the gain in modularity is maximized. Then, the groupings 
obtained at the first step are turned into nodes, yielding a new graph. The pro- 
cess iterates until a peak in modularity is attained and no more change occurs. 
Note that since our goal here is only to sample relevant local neighborhoods from 
the graph, we could have used any other state-of-the-art community detection 
algorithm. We opted for Louvain as it is very fast and scalable. 


Patch Normalization with Graph Kernels. After extracting the subgraphs 
(communities) from a given input graph, standardization is necessary before 
being able to pass them to a CNN. We can define this step as that of patch 
normalization. To this purpose, we leverage graph kernels, as described next. 
Note that since the steps below do not depend on the way the subgraphs were 
obtained, we use the term subgraph (or patch) rather than community in what 
follows, to highlight the generality of our approach. 

Let G = {G,,G2,...,Gn} be the collection of input graphs. Let 
S1,S2,...,Sy be the sets of subgraphs extracted from graphs G1,Ga,...,GN 
respectively. Since the number of subgraphs extracted from each graph may 
depend on the graph (like in our case with the Louvain community detection 
algorithm), these sets vary in size. 

Furthermore, let 57 be the jt? element of S; (i.e., the jt” subgraph extracted 
from G;), and P; be the size of S; (i.e., the total number of subgraphs extracted 
from G;). Let then S = {S? : i € {1,2,...,N},j € {1,2,..., P;}} be the set of 
subgraphs extracted from all the graphs in the collection, and P its cardinality. 
Let finally K € RP“? be the symmetric positive semidefinite kernel matrix 
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constructed from S using a graph kernel k. Since the total number P of subgraphs 
for all the graphs in the collection is very large, populating the full kernel matrix 
K and factorizing it to obtain low-dimensional representations of the subgraphs 
is O(P3). Fortunately, the Nyström method [17] allows us to obtain Q € RP*P 
(with p < P) such that K ~ QQ! at the reduced cost of O(p?P), by using only 
a small subset of p columns (or rows) of the kernel matrix. The rows of Q are 
low-dimensional representations of the subgraphs and serve as our normalized 
patches. 


3.2 Graph Processing 


1D Convolution. To process a given input graph, many filters are convolved 
with the normalized representations of the patches contained in the graph. For 
example, for a given filter w € RP, a feature c; is generated from the jt? patch 
of graph G; z? as: 
cj =0(w' 21) 

where o is an activation function. In this study, we used the identity function 
o(c) = c, as we observed no difference in results compared to nonlinear activa- 
tions. Therefore, when applied to a patch zl , the convolution operation corre- 
sponds to the inner product (w, zi ). We will show next that any filter w with 
||w|| < oo learned by our network belongs to the Reproducing Kernel Hilbert 
Space (RKHS) H of the employed graph kernel k. 


Theorem 1. The filters live in the RKHS of the kernel k that was used to 
normalize the patches. 


Proof. Given two subgraphs SÍ and si, extracted from G; and G’, and their 
associated normalized patches zi and 2, it holds that: 


F s$ 


(zi, zi) = (SÍ, SÍ) = (ASE) 9(S%, Vyn 


Let Z = (22 : i € {1,2,...,N},j € {1,2,..., P;}} be the set containing all 
patches of the input graphs. Then, Span(Z) is either the space of all vectors in 
RP if the rank of the kernel matrix is P or the space of all vectors in R? whose 
last t components are zero if the rank of the kernel matrix is P — t where t > 0. 
Then, given a patch z?, vector w is contained in Span(Z), hence: 


4) 


N P; 
o(w" 2h) = (w, 28) = (I), af zhi, 22) 
i'=1 j'=1 
N P; , ; N Pi 
= 5 5 aj (25, ,25) = 5 5 a}, k(S}, , 57) 
i'=1 j'=1 i'=1 j'=1 


which shows that the filters live in the RKHS associated to graph kernel k. For 
other smooth activation functions, one can also show that the filters will be 
contained in the corresponding RKHS of the kernel function [20]. 


Kernel Graph Convolutional Neural Networks 27 


Note that the proposed approach can be thought of as a CNN that works directly 
on graphs. In computer vision, convolution corresponds to the element-wise mul- 
tiplication between part of an image and a filter followed by summation. Con- 
volution can thus be viewed as an inner-product where the output is a single 
feature. In our setting, convolution corresponds to the inner-product between 
part of a graph (i.e. a patch) and a filter (i.e. a graph). Such an inner-product is 
implicitly computed using a graph kernel, and the output is also a single feature. 

By convolving w with all the normalized patches of the graph, the following 
feature map is produced: 

c= [c1,C2,...,cpu..] | 
where Prax = max(P; : i € {1,2,...,N}) is the largest number of patches 
extracted from any given graph in the collection. For graphs featuring less than 
Pmax patches, zero-padding is employed. 

Note that this approach is similar to concatenating all the vector represen- 
tations of the patches contained in a given graph (padding if necessary), thus 
obtaining a single vector representation of the graph, and sliding over it a unidi- 
mensional filter of size the length of a single patch vector, without overspanning 
patches (i.e., with stride equal to filter size). 


Pooling. We then apply a max-pooling operation over the feature map, thus 
retaining only the maximum value of c, max(cj,C2,...,CP,,,,), as the signal 
associated with w. The intuition is that some subgraphs of a graph are good 
indicators of the class the graph belongs to, and that this information will be 
picked up by the max-pooling operation. 


3.3 Processing New Graphs 


When provided with a never-seen graph (at test time), we first sample subgraphs 
from it (here, via community detection), and then project them to the feature 
space of the subgraphs in the training set. Given a new subgraph SÍ, its pro- 
jection can be computed as 24 = Qiv where Qt € R?*? is the pseudoinverse of 
Q € RP”? and ve RP is the vector containing the kernel value between SÍ and 
all P subgraphs in the training set (those contained in set S). The dimension- 
ality p of the emerging vector is the same as that of the normalized patches in 
the training set. Thus, this vector can be convolved with the filters of the CNN 
as previously described. 


3.4 Channels 


Rather than selecting one graph kernel in particular to normalize the patches, 
several kernels can be jointly used. The different representations provided by 
each kernel can then be passed to the CNN through different channels, or depth 
dimensions. Intuitively, this can be very beneficial, as each kernel might capture 
different, complementary aspects of similarity between subgraphs. We experi- 
mented with the following popular kernels: 
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e Shortest path kernel (SP) [2]: to compute the similarity between two 
graphs, this kernel counts how many pairs of shortest paths have the same 
source and sink labels, and identical length, in the two graphs. The runtime 
complexity for a pair of graphs featuring nı and na nodes is O(nı?n2?). 

e Weisfeiler-Lehman subtree kernel (WL) [12]: for a certain number h of 
iterations, this kernel performs an exact matching between the compressed 
multiset labels of the two graphs, while at each iteration it updates these 
labels. It requires O(hm) time for a pair of graphs with m edges. 


This gave us two single channel models (KCNN SP, KCNN WL), and one 
model with two channels (KCNN SP + WL). 


4 Experimental Setup 


4.1 Synthetic Dataset 


Dataset. As previously mentioned, the intuition is that our proposed KCNN 
model is particularly well suited for settings where some regions in the graphs are 
highly discriminative of the class the graph belongs to. To empirically verify this 
claim, we created a dataset featuring 1000 synthetic graphs generated as follows. 
First, we generate an Erdos-Rényi graph with number of vertices sampled from 
ZN [100, 200] with uniform probability, and edge probability equal to 0.1. We 
then add to the graph either a 10-clique or a 10-star graph by connecting the 
vertices with probability 0.1. The first class of the dataset is made of the graphs 
containing a 10-clique, while the second class features the graphs containing a 
10-star subgraph. The two classes are of equal size (500 graphs each). 


Baselines. We compared our model against the shortest-path kernel (SP) 
[2], the Weisfeiler-Lehman subtree kernel (WL) [12], and the graphlet kernel 
(GR) [13]. 


Configuration. We performed 10-fold cross-validation. The C parameter of 
the SVM (for all graph kernels) and the number of iterations (for the WL kernel 
baseline) were optimized on a 90/10 split of the training set of each fold. For the 
graphlet kernel, we sampled 1000 graphlets of size up to 6 from each graph. For 
our proposed KCNN, we used an architecture with one convolution-pooling block 
followed by a fully connected layer with 128 units. The ReLU activation was used, 
and regularization was ensured with dropout (0.5 rate). A final softmax layer 
was added to complete the architecture. The dimensionality of the normalized 
patches (number of columns of Q) was set to p = 100, and we used 256 filters (of 
size p, as explained in Subsect. 3.2). Batch size was set to 64, and the number of 
epochs and learning rate were optimized by performing 10-fold cross-validation 
on the training set of each fold. All experiments were run on a single machine 
consisting of a 3.4GHz Intel Core 17 CPU with 16GB of RAM and an NVidia 
GeForce Titan Xp GPU. 


Results. We report in Table 1 average prediction accuracies of our three mod- 
els in comparison to the baselines. Results validated the hypothesis that our 
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Table 1. Classification accuracy of state-of-the-art graph kernels: shortest path (SP), 
graphlet (GR), and Weisfeiler-Lehman subtree (WL); and the single and multichannel 
variants of our approach (KCNN), on the synthetic dataset. 


SP [GR |WL |KCNN SP|KCNN WL|KCNN SP + WL 
75.47 | 69.34 | 65.88 | 98.20 97.25 98.40 


proposed model (KCNN) can identify those areas in the graphs that are most 
predictive of the class labels, as its three variants achieved accuracies greater 
than 98%. Conversely, the baseline kernels failed to discriminate between the 
two categories. Hence, it is clear that in such settings, our model is more effec- 
tive than existing methods. 


4.2 Real-World Datasets 


Datasets. We also evaluated the performance of our approach on five bioinfor- 
matics (ENZYMES, NCI1, PROTEINS, PTC-MR, D&D) and five social net- 
work datasets (IMDB-BINARY, IMDB-MULTI, REDDIT-BINARY, REDDIT- 
MULTI-5K, COLLAB)!. Notice that the bioinformatics datasets are labeled 
(labels on vertices), while the social interaction datasets are not. 


Baselines. We evaluated our model in comparison with the shortest-path kernel 
(SP) [2], the random walk kernel (RW) [16], the graphlet kernel (GR) [13], the 
Weisfeiler-Lehman subtree kernel (WL) [12], the best kernel from the deep graph 
kernel framework (Deep Graph Kernels) [19], and a recently proposed graph 
CNN (PSCN k = 10) [10]. Since the experimental setup is the same, we report 
the results of [19] and [10]. 


Configuration. Same as Subsect. 4.1 above. 


Results. The 10-fold cross-validation average test set accuracy of our approach 
and the baselines is reported in Table2. Our approach outperforms all base- 
lines on 7 out of the 10 datasets. In some cases, the gains in accuracy over 
the best performing competitors are considerable. For instance, on the IMDB- 
MULTI, COLLAB, and D&D datasets, we offer respective absolute improve- 
ments of 2.23%, 2.33%, and 2.56% in accuracy over the best competitor, the 
state-of-the-art graph CNN (PSCN k = 10). Finally, it should be noted that 
on the IMDB-MULTI dataset, every variant of our architecture outperforms all 
baselines. 

Interpretation. Overall, our Kernel CNN model reaches better performance 
than the classical graph kernels (SP, GR, RW, and WL), showing that the ability 
of CNNs to learn their own features during training is superior to disjoint feature 
computation and learning. It is true that our approach also comprises two disjoint 
steps. However, the first step is only a data preprocessing step, where we extract 


! The datasets, further references and statistics are available at https://ls11-www.cs. 
tu-dortmund.de/staff/morris/graphkerneldatasets. 
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Table 2. 10-fold cross validation average classification accuracy (+ standard deviation) 
ofthe proposed models and the baselines on the bioinformatics (top) and social network 
(bottom) datasets. Best performance per dataset in bold, among the variants of our 
Kernel CNN model underlined. 


Method Dataset 
ENZYMES NCI1 PROTEINS PTC-MR D&D 
SP 40.10 (+ 1.50) |73.00 (+ 0.51) |75.07 (+ 0.54) 58.24 (+ 2.44) >3 days 
GR 26.61 (+ 0.99) [62.28 (+ 0.29) [71.67 (+ 0.55) 57.26 (+ 1.41) 78.45 (+ 0.26) 
RW 24.16 (+ 1.64) |>3 days 74.22 (+ 0.42) 157.85 (+ 1.30) >3 days 
WL 53.15 (+ 1.14) [80.13 (+ 0.50) [72.92 (+ 0.56) 56.97 (+ 2.01) 77.95 (+ 0.70) 
Deep Kernels 53.43 (+ 0.91) [80.31 (+ 0.46)|75.68 (+ 0.54) 60.08 (+ 2.55) NA 
PSCN k = 10 NA 76.34 (+ 1.68) [75.00 (+ 2.51) 62.29 (+ 5.68) 76.27 (+ 2.64) 
KCNN SP 46.35 (+ 0.39) [75.70 (+ 0.31) [74.27 (+ 0.22) 62.94 (+ 1.69) 76.63 (+ 0.09) 
KCNN WL 43.08 (+ 0.68) [75.83 (+ 0.25) [75.76 (+ 0.28) 61.52 (+ 1.41) 75.80 (+ 0.07) 
KCNN SP + WL/48.12 (+ 0.23) 77.21 (+ 0.22) [73.79 (+ 0.29) 62.05 (+ 1.41) 78.83 (+ 0.29) 
IMDB BINARY[IMDB MULTI [REDDIT BINARY REDDIT MULTI 5K|[COLLAB 
GR 65.87 (+ 0.98) [43.89 (+ 0.38) [77.34 (+ 0.18) 41.01 (+ 0.17) 72.84 (+ 0.28) 
Deep GR 66.96 (+ 0.56) [44.55 (+ 0.52) [78.04 (+ 0.39) 41.27 (+ 0.18) 73.09 (+ 0.25) 
PSCN k = 10 71.00 (+ 2.29) |45.23 (+ 2.84) [86.30 (+ 1.58) 49.10 (+ 0.70) 72.60 (+ 2.15) 
KCNN SP 69.60 (+ 0.44) [45.99 (+ 0.23) [77.23 (+ 0.15) 44.86 (+ 0.24) 70.78 (+ 0.12) 
KCNN WL 70.46 (+ 0.45) (46.44 (+ 0.24) [81.85 (+ 0.12) 50.04 (+ 0.19) 74.93 (+ 0.14) 
KCNN SP + WL/71.45 (+ 0.15) [47.46 (+ 0.21)/78.35 (+ 0.11) 44.63 (+ 0.18) 74.12 (+ 0.17) 


neighborhoods from the graphs, and normalize them with graph kernels. The 
features used for classification are then learned during training by our neural 
architecture, unlike the GK + SVM approach, where the features, given by the 
kernel matrix, are computed in advance, independently from the downstream 
task. 

Our two single-channel architectures perform comparably on the bioinfor- 
matics datasets, while the KCNN WL variant was superior on the social net- 
work datasets. On the REDDIT-BINARY, REDDIT-MULTE5K and COLLAB 
datasets, KCNN WL also outperforms the multichannel architecture, with quite 
wide margins. The multi-channel architecture (KCNN SP + WL) leads to better 
results on 5 out of the 10 datasets, showing that capturing subgraph similarity 
from a variety of angles sometimes helps. 


Table 3. 10-fold cross validation runtime of proposed models on the 10 real-world 
graph classification datasets. 


ENZYMESNCI1 |PROTEINSPTC-MRDE&D [IMDB IMDB REDDIT [REDDIT |COLLAB 
BINARY [MULTI BINARY [MULTI-5K 


KCNN SP 128” 4’ 26” 142” 22” 54” [36” P 41” 5’ 29” 15’ 2” 7’ 2” 
KCNN WLI53” 4’ 54”148” 22” 1’ 33”\41” 58” 5’ 22” 14’ 23” 8’ 58” 
KCNN SP |1’ 13” 5’ 1” 153” 25” 1’ 46”|45” 1 44” 9° 57” 24’ 28” 10’ 24” 


+ WL 
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Runtimes. We also report the time cost of our three models in Table 3. Runtime 
includes all steps of the process: patch extraction, path normalization, and 10- 
fold cross validation procedure. We can see that the computational complexity 
of the proposed models is not high. Our most computationally intensive model 
(KCNN SP + WL) takes less than 25min to perform the full 10-fold cross 
validation procedure on the largest dataset (REDDIT-MULTI-5K). Moreover, 
in most cases, the running times are lower or comparable to the ones of the 
state-of-the-art Graph CNN and Deep Graph Kernels models [10,19]. 


5 Conclusion 


In this paper, we proposed a method that combines graph kernels with CNNs 
to learn graph representations and to perform graph classification. Our Kernel 
Graph CNN model (KCNN) outperforms 6 state-of-the-art graph kernels and 
graph CNN baselines on 7 datasets out of 10. 
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Abstract. The three-phase induction motors are widely used in a lot of 
applications both industry and other environments. Although this electrical 
machine is robust and reliable for industrial tasks, for example, conditioning 
monitoring techniques have been investigated during the last years to identify 
some electrical and mechanical faults in induction motors. In this sense, broken 
rotor bars is a typical fault related to the induction machine damage and the 
current technical solutions have shown some drawbacks for this kind of failure 
diagnosis, particularly when motor is running at very low slip. Therefore, this 
paper proposes a new use of Histogram of Oriented Gradients, usually applied in 
computer vision and image processing, for broken bars detection, using data 
from only one phase of the stator current of the machine. The intensity gradients 
and edge directions of each time-window of the stator signal have been applied 
as inputs for a neural network classifier. This method has been validated using 
some experimental data from a 7.5 kW squirrel cage induction machine running 
at distinct load levels (slip conditions). 


Keywords: Induction motors - Broken rotor bars > Stator current 
Neural network classifier 


1 Introduction 


During the past decades, conditioning monitoring techniques have been applied by 
several researchers for failure detection in induction motors (IM), as well as in pre- 
dictive maintenance programs at industry. Today, the induction motors are responsible 
for many load drivers and also capable of applying its power in a variety of energy 
conversion processes [1]. However, the IMs have some technical limitations, such as 
mechanical stresses or electromagnetic strengths that are usually related to damages in 
stator and rotor cage [2]. For larger machines, for example, longer downtime per failure 
usually occurred with induction motors starting more than once per day, or in appli- 
cations of pulsating load or direct on-line startups [3]. 

A noninvasive technique, called motor current signature analysis (MCSA), is 
currently applied for broken bars detection and has been used over the last decades, 
particularly due to its noninvasive characteristic and attractive applications in industrial 
environment, but MCSA has some drawbacks related to rotor failures diagnosis, such 
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as detection at very low slip (low load or no load) and nonadjacent broken bars, as cited 
by [4-6]. The sideband frequencies (features extracted from stator current) which are 
related to MCSA are usually near the fundamental frequency for a motor running at 
low load, thus it is quite difficult to distinguish between a healthy and failure rotor. 
Therefore, in many cases MCSA is responsible for both false positive and negative 
alarms in the rotor broken bars evaluation [4]. 

Other signal processing and feature extraction methods have been used for failure 
diagnosis on induction motors using time and/or frequency domain data, such as 
described by [7-11]. In general, such works have disclosed the use of Fast Fourier 
Transform (FFT), Hilbert Transform (HT), Esprit and Empirical Mode Decomposition 
(EMD) to extract some information from stator current and other signals from a IM 
with broken bars. However, most of them require a long data acquisition time and a 
high frequency resolution to ensure the failure detection efficiency. 

In addition, other studies have demonstrated the use of some machine learning and 
artificial intelligence approaches to detect no only broken rotor bars, but also other 
types of failures in induction motors as cited by [12-15]. A recent work published by 
[16], for example, has disclosed the current methods used for fault diagnosis on rotating 
machinery, such as artificial neural networks, clustering algorithms, deep learning and 
hybrid techniques. 

Based on the aforementioned state of the art, the present work proposes a new 
approach for broken rotor bars diagnosis, using histogram oriented gradients 
(HOG) method [17], using only one single phase data of the stator current. The main 
features of stator current data have been extracted from the intensity gradients and edge 
directions for a multilayer perceptron classifier (MLP). In addition, this paper discusses 
the present approach for broken bars detection when induction motors are operating at 
reduced load or low slip. 


2 Theoretical Background 


An analog signal is a physical process that depends on time and can be modeled by a 
real function on a variable real that representing time. 

In this paper, this function models the stator current from an induction motor which 
represents a sinusoidal and periodic signal of the electrical machine. The amplitude of 
this signal depends on the load torque applied to the shaft of the motor. The stator 
current signals can be digitized by a process called sampling which approximate the 
stator current signals taken at regular time intervals. 

Thus, the digital stator current signals is represented by a function u : D C Z > R, 
in which a sample x € D is an integer number representing a discrete instance in a 
sampled time of Ts seconds. In addition, this signal, which is periodic, can be divided 
into cycles with duration of 1/f seconds, since f is the fundamental frequency set to 
60 Hz. Thus, we consider W = { W1, W2,---, Wry } a partition on D such that for any 
1<i<Tw, follows that a time-window W; contains the samples of some complete 
cycles of signal u. Thus, the time-window W; contains Weycie complete cycles with 
ax Weycte samples of a sample time Ts = Weycte X a x Ty seconds. Note that, W is a 


60 
set non-empty, its elements are disjoint and the union of its time-windows is W. 
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2.1 The Histogram of Oriented Gradients as a Feature Descriptor 


The HOG is a feature descriptor, introduced by Dalal and Triggs, for the detection of 
pedestrians in photographs [17] and later used for other object detection problems, as 
the solutions disclosed by [18] and [19]. The HOG is a technique for describing the 
original signal u through a histogram of the gradient direction. The gradient V(u) can 
be computed by a simple difference schema, as follows: 


Vx € D,[V(w)](x) _ ux- o a 


The gradient direction O(V(u)) at point x € D is expressed as an angle in intervals 
of [0,27] radians and can be computed, as follows: 


Vx € D, [0(V(u))](x) = tan” (V(u)) (2) 


Then, each sample x € D contributes to the histogram with a value proportional to 
its gradient magnitude that can obtained by: 


Vx € D, [p(V(u))1(0) = y Vu)” 3) 


The histogram is constructed for a small number np;,; of bins corresponding to 
regular intervals of gradient direction. Besides that, a sample x localized in k-th bin can 
contribute to two angle range in the histogram according to the distance ratio between 
the bin angle center 0, and the sample angle [9(V(u))](x). This proportion is given as 
follows: 


(4) 


a= max 0 h OT) - a) 


Nbins 


Therefore, we compute a histogram HOG for each time-window W; € W as follows: 


[HOG(u, Wi)|(k) = I, ox(x)p(x), for k= 1,2,- pins (5) 


xeW; 


where œ(x) is defined in Eq. (4) and p(x) is defined in Eq. (3). 


3 The HOG-MLP Method for Broken Bars Detection 


The proposed method is based on divide-to-conquer approach. The idea is to divide the 
problem into sub-problems and then the sub-problem solutions are combined to give a 
solution to the original problem. In this sense, our original problem is to classify broken 
rotor bars through the stator current signal. So, we divide the stator current signal into 
time-windows given by the partition W. Then, each time-window W; is classified 
through a multilayer perceptron. Thus, we combine the results of the MLP into a single 
classification through the bayesian classifier. 
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The proposed method for the diagnosis of broken rotor bars consists of six stages 
(see Fig. 1) which comprise: (i) Acquisition of stator current signal; (ii) Signal sim- 
plification; (iii) Signal segmentation in cycles; (iv) Feature extraction; (v) Classification 
of time-window; and (vi) Fault detection. 


Induction 
motor 


Stator Signal Signal 
current simplification segmentation 


acquisition [>] (filtering) in cycles 
Q Current 
sensor Fault Classification of Feature 
(cr) detection signal cycles fe~ extraction 
(ANN) (Hos) 


Fig. 1. Squematic view for broken bars detection using HOG and MLP. 


Acquisition of Stator Current Signal: A table representing the function 
u:DCZ—R, is constructed from the stator current data. These data have been 
collected from motor running at four distinct load torque conditions, i.e., the braking 
system has been supplied with 40 V (slip = 0.66%), 50 V (slip = 0.077%), 60 V 
(slip = 1%) and 70 V (slip = 1.16%), thus the motor was running at very low slip in all 
cases (close to or lower than 1%). It is important to highlight that large motors usually 
run at low slip even for rated load, and small motors often operate at below rated load 
in many industrial applications. The slip s can be defined as the difference between the 
flux speed Ns and the rotor speed Nr and is usually expressed as a percentage of 
synchronous speed (Ns), i.e., s = aye x 100%. The stator current was sampled at a 
time of 10 s (i.e., 7, = 10 s), thus, considering the fundamental frequency of 60 Hz and 
a sample frequency of 10 kHz. 


Signal Simplification: After collecting the data from motor, the stator current was 
filtered to reduce the noise and to contribute for signal processing in the time domain. 
A Butterworth sixth order low pass filter was used in a cutoff frequency of 200 Hz, 
since this value was able to extract the waveform distortion according to the rotor 
failure. It is important to highlight that the distortion of the sinusoidal wave (stator 
current) is greater in the presence of broken bars, since this failure produces harmonic 
components with higher amplitudes (rotor slots harmonics). 


Signal Segmentation in Cycles: As the sample time is 10 s, the fundamental fre- 
quency is 60 Hz and sampled frequency is 10 kHz. Then, each sample time has 600 
cycles and each cycle contains 167 samples. In addition, the partition W is constructed 
in the following ways: either 600 time-windows of a single cycle each (1.e., Weycie = 1) 
or 20 time-windows with 30 cycles each (i.e., Weycte = 30). 


Feature Extraction: The feature extraction was performed for each time-window of 
partition W and thus it was constructed a set of feature vectors from a stator current 
signals u, i.e., k(u) = {HOG(u, W;) : W; € W}. From the descriptors extracted from 
the stator current signals we constructed the training and validation datasets, in which 
6400 labeled examples were used for the training dataset. 
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Moreover, the datasets were constructed using balanced samples, that is, both 
classes contain the same amount of samples. It is worth remembering that we have 
constructed a pair training/validation dataset for each of our approach parameters that is 
discussed in Sect. 4. 


Classification of Time-Window: A typical MLP classifier is built to using a training 
set S = {(pr,cx) ER” x {0,1} : k = 1,2,---,60 x Tw} of labeled feature vectors. 
The features vector is given by HOG(u, W;) € k(u) of a time-window W; € W of a 
stator current signal u into time-window of a healthy stator current signal (labeled “0”) 
or time-window of an unhealthy stator current signal (labeled “1”), i.e., 
MLB : R"! — {0,1}. 

The ANN was trained with 37 input features extracted from HOG, using the 
Levenberg-Marquardt algorithm and only one hidden layer was used in its topology. 
The training error obtained for the MLP classifier is about 1.7 x 107° using the tra- 
ditional k-fold-cross-validation (with k= 10) technique to evaluate the classifier 
performance. 


Fault Detection: The last stage comprises the combining each time-window classifi- 
cation for the rotor fault detection. This procedure is performed using bayesian clas- 
sifier. Thus, given a stator current signal u, we detected the rotor condition as follows: 


Bayesian Classifier = ae, . if P(y = I|k(u)) > PO = O|k(u)) 
non—failure, otherwise 
failure, if Ply = Ak(w)) > P(y = O|k(u)) 


Bayes Cheer { non—failure, otherwise (6) 


where the posterior probabilities P(y = 1|k(u)) > P(y = O|k(u)) are designed using 
MLP classifier as follows: 


P(y = ku) = Y) — 


X;¡€ k(u) 


(7) 


and P(y = O|k(u)) = 1 — P(y = 1|k(u)). The priori probabilities P(y= 1) and 
P(y = 0) are discussed in Sect. 4. 


4 Experimental Results 


As mentioned before, a current sensor (CT - current transform) was used to measure the 
stator current from a 7.5 kW squirrel cage induction motor (rated speed = 1800 rpm). 
This signal has been collected using a PC and an USB digital Oscilloscope Hantek, 
model HT6022BE, with bandwidth in 20 MHz and maximum real-time sample rate of 
48 MS/s. The data was collected from some tests performed at laboratory, considering 
the motor running at rated frequency (60 Hz) and under distinct load levels. Figure 2 
shows the experimental setup of the induction motor. 
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PC computer 
(Data stored) 
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Rotor with one broken bar Rotor with one broken bar 


(a) Experimental setup (b) Healthy rotor and rotor with one broken bar 


Fig. 2. (a) Experimental setup and (b) Two rotor conditions. 


For experimental tests and rotor evaluation, the stator current data have been col- 
lected from motor running at four distinct load torque conditions, i.e., the braking 
system has been supplied with voltage equal to 40 V, 50 V, 60 V and 70 V. The stator 
current was sampled at a sampled time of 10 s (i.e., Tẹ = 10 s). Thus, considering the 
fundamental frequency of 60 Hz and a sample frequency of 10 kHz, each sampled time 
has 600 cycles. The classification error and the accuracy were obtained using a 10-fold- 
cross-validation, by considering some stator signal parameters variation. The perfor- 
mance of the MLP classifier is better described as follows. 


4.1 Analysis of Parameters for the Proposed Method 


In this subsection we show an analysis based on receiver operating characteristic 
(ROC) to find of the best parameters for ours approach. The parameters studied were 
(1) the angle range of HOG, i.e., the parameter npins; (2) the time-window length, i.e., 
the parameter Weycie; and (3) the threshold value for output classification. 

We study the gradient directions used in the HOG and we realized that the angles 
are in intervals of —90°, +90° giving a total of 37 angles. Thus, we analyzed the 
parameter npins varying of [1, 5] the quantity of angle by bin of HOG using a ROC 
curve. Analogously, we analyzed the parameter W.ycje for some time-window lengths. 
Figure 3 shows respectively the ROC curve results for a time-window of only one 
cycle and also for 30 cycles, according to the HOG bin angle variation. 
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(a) ROC Curve using Weycie = 1 (b) ROC Curve using Weycie = 30 


Fig. 3. Analysis of ROC curves to determine the better W.ycle and npins parameters. 
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It is possible to note that the ROC curves generated from a stator signal processed 
with 30 cycles has demonstrated a better performance than those obtained for only one 
cycle, even for distinct HOG angles, thus, in this paper the time-window of 30 cycles 
(i.e., 0.5 s) was chosen for broken rotor bar detection using MLP classifier. As men- 
tioned by [20], the more the ROC curve is to the upper left corner the better the 
classifier performance is. Using the parameters selected, a typical bin angle distribution 
for a healthy motor and a damaged rotor is shows in Fig. 4. It is possible to note some 
HOG bin angle amplitudes variation according to the two classes conditions (healthy 
and faulted rotor). 


Amplitude of each bin angle 


Healthy rotor Damaged rotor 


Bin angle of HOG 


Fig. 4. Typical HOG for a healthy motor and a damaged rotor. 


4.2 Fault Detection Using HOG, MLP and Bayesian Approach 


In this work, the HOG angle of 5° was chosen as the best value for histogram descriptor 
distribution. It should be noted that, the MLP has been trained with 60 stator current 
signals. In the previous section, the MLP topologies were trained to defined the best 
parameters for rotor fault detection using HOG (threshold of sigmoid neuron is 0.7, 
Npins = 37 and Weycie = 30). The input layer is related to the number of npjns, thus the 
input of each MLP topology was built with 37 bin angles. In this paper, a single hidden 
layer with 50 neurons was used for rotor fault detection. 

Table 1 shows the results obtained for four load conditions of the rotor evaluation, 
after applying MLP classifier. These results are true positive values (TP), false negative 
(FN), true negative values (TN), false positive values (FP), specificity (SP), sensitivity 
(SN) and accuracy for both learning and validation datasets. In this case, the experi- 
ments numbered between 41 and 70 have been used for validation purposes.In the last 
stage, the rotor fault detection was performed using time-window classification and 
Bayesian classifier, as mentioned in Sect. 3. 
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Table 1. Results for time-window classification after applying MLP classifier. 


Load condition TP |FN [TN |FP ¡SP |SN | Samples | Experiments | Acc (%) 
All loads (training data) | 3128| 72/3137) 63 | 0.98 | 0.97 | 6400 320 0.98 
All loads (validation data) | 2129 | 271 | 2100 | 300 | 0.87 | 0.88 | 4800 240 0.88 
40 V (validation data) 513| 87| 528| 72 | 0.88 | 0.85 | 1200 60 0.87 
50 V (validation data) 537| 63| 529| 71 | 0.88 | 0.90 | 1200 60 0.89 
60 V (validation data) 556| 44| 505| 95 | 0.84 | 0.93 | 1200 60 0.88 
70 V (validation data) 591| 9) 561| 39/ 0.93 | 0.98 | 1200 60 0.96 


For find the priori probabilities P(y = 1) and P(y = 0) we performed a ROC 
analysis and thus P(y = 1) = 0.5 was considered the best value for rotor condition 
diagnosis. Table 2 show the classification (i.e., either faulted or a healthy condition for 
rotor structure) of the experiments after applying the Bayesian classifier. For load 
scenarios, i.e., by feeding the braking system of the induction motor between 40 V and 
70 V, the MLP and Bayesian classifier were able to distinguish between a healthy rotor 
and a damaged structure (one broken bar) in all cases (accuracy around 94%). 


Table 2. Results for broken bars detection after MLP and Bayesian classification. 


Load condition TP FN|TN |FP|SP [SN | Samples | Experiments | Acc (%) 
All loads (40 V to 70 V)| 112/8 /114|6 | 0.93 | 0.95 | 4800 240 0.94 
40 V 2713 29 1 | 0.96 | 0.90 | 1200 60 0.93 
50 V 29 1 29 1 [0.97 0.97 | 1200 60 0.97 
60 V 28 2 2713 | 0.93 | 0.93 | 1200 60 0.91 
70 V 28 2 29 1 | 0.93 | 0.93 | 1200 60 0.95 


5 Conclusions 


This paper proposes a new approach for broken rotor bars detection in squirrel cage 
induction motors, by using a histogram of oriented gradients (HOG). A HOG bin angle 
variation was evaluated for a healthy motor and a damaged rotor with one broken bar, 
using only stator current as a measurement signal from electrical machine. The 
amplitude of each bin angle, after applying HOG on each time-window, has been used 
as inputs for a Multilayer Perceptron Neural Network to detect fully broken rotor bars. 
For better failure classification, a bayesian classifier was applied to detect each 
experiment after time-window subset MLP evaluation. The experimental results have 
shown a good accuracy (around 94%) for failure diagnosis, even when IM was running 
at low load condition, thus at very low slip (close to 1%). Therefore, this time-domain 
approach, using HOG instead of other frequency domain techniques, could be very 
interesting for a rotor failure detection in the future. Further researches are going on to 
better detect the broken bars for other load conditions and also to evaluate the fault 
severity (more broken bars). 
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Abstract. In this paper, Profit Sharing using convolutional neural net- 
work is realized. In the proposed method, action value in Profit Sharing 
is learned by convolutional neural network. This is a method that learns 
the value function of Profit Sharing instead of the value function of Q 
Learning used in the Deep Q-Network. By changing to an error func- 
tion based on the value function of Profit Sharing which can acquire 
probabilistic policy in a shorter time, the proposed method is able to 
learn in a shorter time than the conventional Deep Q-Network. Com- 
puter experiments were carried out on Asterix of Atari 2600, and the 
proposed method was compared with the conventional Deep Q-Network. 
As a result, we confirmed that the proposed method can learn from the 
earlier stage than Deep Q-Network and can obtain higher score finally. 


Keywords: Profit Sharing - Convolutional neural network 


1 Introduction 


In recent years, as a method which shows better performance than the conven- 
tional method in the field of image recognition and speech recognition, the deep 
learning has been drawing attention. Deep learning is a hierarchical neural net- 
work with many layers, and the Convolutional Neural Network (CNN) [1] is one 
of the representative models. 

On the other hand, various studies on reinforcement learning are being con- 
ducted as learning methods to acquire appropriate policies through interaction 
with the environment [2]. In reinforcement learning, learning can proceed by 
repeating trial and error even in an unknown environment by appropriately set- 
ting rewards. 

The Deep Q-Network [5] is based on the convolutional neural network which 
is a representative method of deep learning and the Q Learning [4] which is a 
representative method of reinforcement learning. In the Deep Q-Network, when 
the game screen (observation) is given as an input to the convolutional neural 
network, the action value in Q Learning for each action is output. This method 
can realize learning that acquires a score equal to or higher than that of ahuman 
in plural games. The combination of deep learning and reinforcement learning 
is called Deep Reinforcement Learning, most of which is based on Q Learning. 
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As a deep reinforcement learning using a method other than Q Learning, we 
have proposed a Deep Q-Network using reward distribution [6]. This method 
learns to not take wrong actions, by distributing negative rewards in the same 
way as Profit Sharing [3]. Although this method can perform learning with the 
same degree of precision and speed as Deep Q-Network, it shows that the score 
that can be finally obtained is same level as Deep Q-Network. 

In this paper, we propose a Profit Sharing using convolutional neural network. 
In the proposed method, action value in Profit Sharing is learned by convolu- 
tional neural network. This is a method that learns the value function of Profit 
Sharing instead of the value function of Q Learning used in the Deep Q-Network. 
By changing to an error function based on the value function of Profit Sharing 
which can acquire probabilistic policy in a shorter time, the proposed method 
is able to learn in a shorter time than the conventional Deep Q-Network. Com- 
puter experiments were carried out on Asterix of Atari 2600, and the proposed 
method was compared with the conventional Deep Q-Network. As a result, we 
confirmed that the proposed method can learn from the earlier stage than Deep 
Q-Network and can obtain higher score finally. 


2 Deep Q-Network 


Here, we explain the Deep Q-Network [5] that is the basis of the proposed 
method. The Deep Q-Network is based on the convolutional neural network [1] 
and the Q Learning [4]. In the Deep Q-Network, when the game screen (observa- 
tion) is given as an input to the convolutional neural network, the action value 
in Q Learning for each action is output. This method can realize learning that 
acquires a score equal to or higher than that of a human in plural games. 


2.1 Structure 


The structure of Deep Q-Network is shown in Fig. 1. As seen in Fig. 1, the Deep 
Q-Network is a model based on the convolutional neural network, consisting 
of three convolution layers and two fully connected layers. The play screen of 
the game (observation) is input to the convolutional neural network, and the 
action value for each action corresponding to the observation is outputted. For 
the first to fourth layers, rectified linear function is used as an output function. 
The number of neurons in the last finally connected layer which is the output 
layer is the same as the number of actions that can be taken in the problem to 
be handled. Since the problem learned by Deep Q-Network can be regarded as 
a regression problem to learn the relationship between each observation and the 
action value of each action in the observation, the output function of the output 
layer is an identity mapping function. 
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2.2 Learning 


Since the action value in Q Learning is used as the output, the following error 
function used in learning is given by 


1 2 
E=- tt T+1> ‘= Ty Or 1 
5 6 Ta a? +1, 4) — q(0,, a ) (1) 


where r, is the reward at the time 7, C4(0,+1) is the set of actions that an 
agent can take at the observation o,+1, y is the discount factor, q(o,, a, ) is the 
value of taking action a, at observation o+. 

When the game screen o, is given to the Deep Q-Network, the value of 
all actions in observation o, is output in the output layer. Based on the out- 
put action value, action is determined by the e-greedy method. In the e-greedy 
method, one action is selected randomly with the probability e (0 < e < 1), the 
action whose value is highest with the probability of 1 — e. 

The probability to select the action a in observation o+, P(o,,a) is given by 


(l—e)+ Ca (i a = argmax q(o,, 0)) 


aecA 
(otherwise) 


P(o,,a) = 
[04] 
(2) 


where, |C4| is the number of action types that the agent can take, which is the 
same as the number of neurons in the output layer of the Deep Q-Network. 

The selected action a- is executed, and the state transits to the next state 
O tau+1- Also, by taking the action a+, the reward r, is given based on the score, 
game state and so on. 

Learning is unstable merely by approximating the action value of Q Learning 
using the convolutional neural network, so in the learning of the Deep Q-Network, 
some ideas called Experience Replay, Fixed Target Q-Network, Reward Clipping 
are introduced. 


Observation 
Cin Layer 
ee Layer 
EE Layer 

Fully m ied Layer 
Fully == Layer 
en 

(Q Learning) 


Fig. 1. Structure of Deep Q-Network. 
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3 Profit Sharing Using Convolutional Neural Network 


Here, the proposed Profit Sharing using Convolutional Neural Network is 
explained. 


3.1 Outline 


In the proposed method, action value in Profit Sharing is learned by convolu- 
tional neural network. This is a method that learns the value function of Profit 
Sharing instead of the value function of Q Learning used in the Deep Q-Network. 
By changing to an error function based on the value function of Profit Sharing 
which can acquire probabilistic policy in a shorter time, the proposed method 
is able to learn in a shorter time than the conventional Deep Q-Network. How- 
ever, in the Profit Sharing, since temporally continuous data is meaningful in 
episodes, experience replay used in the Deep Q-Network is not used in the pro- 
posed method. The Q Learning uses fixed target Q-Network because the value 
of other rules is also used when updating the value of the rule. In contrast, the 
Profit Sharing uses the value of the rule included in the episode in updating the 
connection weights. Therefore, the proposed method does not use fixed target 
Q-Network. 


3.2 Structure 


The structure of the convolutional neural network used in the proposed method 
is shown in Fig. 2. As similar as the conventional Deep Q-Network, the convolu- 
tional neural network used in the proposed method consists of three convolution 
layers and two full-connected layers. The input to the convolutional neural net- 
work is the play screen of the game. The output of the convolutional neural 
network is value of each action for that state. 


Observation 
"ARA Layer 
Ep: a Layer 
E Layer 

Fully conn ted Layer 
Fully == Layer 
a 
(Profit Sharing 


Fig. 2. Structure of convolutional neural network used in proposed method. 
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3.3 Learning 


In the proposed method, the convolutional neural network learns to output the 
value of each action corresponding to the play screen of the game (observation) 
which is given as input. Here, the action value is updated based on the Profit 
Sharing. So, the error function E is given by 


(rF(r) = q(07, ar))? (3) 


where r is reward, q(o,,a,) is the value of taking action a, at observation o+. 
F(T) is the reinforcement function at the time 7 and is given by 


1 


OT (eae 


(4) 


where C4 is the set of actions that an agent can take at the observation, |C4| 
is the number of actions that an agent can take, W is the length of an episode. 

The action is selected based using the e-greedy as similar as the conventional 
Deep Q-Network. 


4 Computer Experiment Results 


To demonstrate the effectiveness of the proposed method, computer experiments 
were conducted on a game of Atari 2600 (Asterix). The results are shown below. 


4.1 Task 


Asterix is an action game shown in Fig. 3. A player can operate own machine up 
and down, left and right. From the left and right of the screen, jars and harps 
fly. You can score 50 points by taking a jar. Taking the harp will reduce the 
remaining machines. At the start of the game, there are three machines. When 
the remaining machine runs out, the game ends. The score of the game is the 
sum of the scores acquired by the end of the game. 

The actions of the agent are five kinds of movement; moving to up, down, left 
and right, and not moving. The agent gets a positive reward (1) when it gains 
score. In addition, the agent acquires a negative reward (—1) when a remaining 
machine decreases. 


4.2 Experimental Conditions 


Table 1 shows the conditions for the convolutional neural network used in the 
proposed method and the conventional Deep Q-Network. The game screen used 
in this research is an RGB image of 400 x 500. In the experiment, the RGB image 
is grayscaled, reduced to 84 x 84 pixels, and an image grouped for 4 frames is 
used as input. 
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Fig. 3. Asterix. 


Table2 shows other conditions related to learning. An action is selected by 
e-greedy. At the start of learning, e is set to 1 so that actions are randomly 
selected. After that, e is decreased until it becomes 1/10% every action (one 
step). The agent gradually emphasizes the action value and selects an action. 

In the proposed method, since Profit Sharing is used, as the length of the 
episode becomes longer, the value of the denominator on the right side of Eq. (4) 
becomes too large and the reward can not be distributed sufficiently. Therefore, 
only five steps before acquisition of the score are regarded as episodes. 


4.3 Transition of Obtained Scores 


Here, a game of atari 2600 (Asterix) are learned by the proposed Profit Sharing 
using convolutional neural network, and we compared the transition of the score 
with the conventional Deep Q-Network. 

Figure 4 shows the transition of obtained scores in each method. This figure 
is the average of scores every 50 thousand times. 


Table 1. Experimental conditions (1). 


Filter size | Stride | Output size Output function 
Input = = 84 x 84x4 = 
Convolution layer 1 8x8 4 20 x 20 x 32 ReLU 
Convolution layer 2 4x4 2 9 x 9x64 ReLU 
Convolution layer 3 3x3 1 7x7x64 ReLU 
Full-connected layer 1 |— = 512 ReLU 


Full-connected layer 2 | — = 5 (the number of actions) | Identity function 
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Table 2. Experimental conditions (2) 


The number of learning steps 1.0 x 107 
Initial value of e Eini 1 
Decrease amount of e Er 1/10° 
Minimum of e Emin 0.1 

e in evaluation episodes E 0.05 

Size of replay memory Dmax | 10° 

Size of mini batch M 32 
Discount Rate y 0.99 


Update interval of target network | Tupdate 104 


Proposed Method 


Score 


or HM \ 


Conventional Deep Q-Network 


o 2000000 4000000 6000000 8000000 10000000 
The Number of Steps 


Fig. 4. Transition of obtained scores. 


Asterix is a problem which is considered to be difficult to learn on the con- 
ventional Deep Q-Network, and the acquired score is not stable up to 5 million 
steps. However, after that, the acquired score rises, and the average score of 
acquisition at 10 million steps is about 90 points. In the proposed method, the 
score increases up to the first 5 million steps, and after that, it is able to obtain 
a high score stably. e in the e-greedy method is set to be the minimum value 
(0.1) at the time of 5 million steps. Considering that the score is stable in both 
methods after 5 million steps, we think that it may be possible that the progress 
of learning may change by changing the way of decreasing £. According to the 
result of Fig. 4, we confirmed that learning can be done from the earlier stage 
than the conventional Deep Q-Network in the proposed method and the score 
obtained finally becomes high. 


5 Conclusions 


In this paper, we have proposed the Profit Sharing using convolutional neural 
network. In the proposed method, action value in Profit Sharing is learned by 
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convolutional neural network. This is a method that learns the value function of 
Profit Sharing instead of the value function of Q Learning used in the Deep Q- 
Network. By changing to an error function based on the value function of Profit 
Sharing which can acquire probabilistic policy in a shorter time, the proposed 
method is able to learn in a shorter time than the conventional Deep Q-Network. 

Computer experiments were carried out on Asterix of Atari 2600, and the 
proposed method was compared with the conventional Deep Q-Network. As a 
result, we confirmed that the proposed method can learn from the earlier stage 
than Deep Q-Network and can obtain higher score finally. 
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Abstract. Fingerprint alteration is a challenge that poses enormous security 
risks. As a result, many research efforts in the scientific community have 
attempted to address this issue. However, non-existence of publicly available 
datasets that contain obfuscation and distortion of fingerprints makes it difficult 
to identify the type of alteration. In this work we present the publicly available 
Sokoto-Coventry Fingerprints Dataset (SOCOFing), which provides ten fin- 
gerprints for 600 different subjects, as well as gender, hand and finger name for 
each image, among other unique characteristics. We also provide a total of 
55,249 images with three levels of alteration for Z-cut, obliteration and central 
rotation synthetic alterations, which are the most common types of obfuscation 
and distortion. In addition, this paper proposes a Convolutional Neural Network 
(CNN) to identify these alterations. The proposed CNN model achieves a 
classification accuracy rate of 98.55%. Results are also compared with a residual 
CNN model pre-trained on ImageNet, which produces an accuracy of 99.88%. 


Keywords: Central rotation - Convolutional neural networks - Distortion 
Fingerprint alteration - Obfuscation - Obliteration - Z-cut 


1 Introduction 


The field of forensic science is the use of applied science and technical approaches to 
provide answers to issues in criminal, civil and administrative law. Fingerprints can be 
altered through abrading [1], cutting [2], burning [3] and distortions, such as skin 
grafting [4], where an unusual and unnatural change in the patterns of the friction ridge 
occurs. The most common alteration types are in the form of Z-cut, central rotation and 
obliteration. In this paper, we present a novel fingerprint dataset with unique attributes, 
such as gender, finger type (like index finger, thumb, ring finger, middle finger and 
little finger) for both left and right hand of the subject, respectively. Furthermore, we 
present preliminary experimental results on the detection of the alteration type using a 
deep CNN and a residual CNN model. The two presented models classify the 
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fingerprint images into Z-cut, central rotation, obliteration and real, i.e. non-altered 
fingerprint. The real fingerprints from our SOCOFing dataset [5], with a total number 
of 6000 fingerprints from 600 subjects, were synthetically altered to central rotation, Z- 
cut and obliteration, which are the common types of alteration, resulting in a total of 
55249 altered fingerprints. The SOCOFing dataset is publically available for replication 
and further experimental research work with the sole aim of improving upon the 
security of biometric fingerprints, such that criminals in the watch list could be iden- 
tified and apprehended even if their fingerprints have been altered. 


2 Related Works 


Boarder Control is one of the major beneficiaries of biometrics, where fingerprints are 
used to detect and recognise individuals. Those that are having past criminal records 
and those that have committed high profile crimes used to undergo certain alterations of 
their fingerprints to avoid detection, especially in refugee and asylum seeker camps [6]. 
Such mutilations come in either burning the fingers or using surgery to cut some part of 
the fingers or body and place them onto another finger (‘grafting), some come in a Z- 
shape, rotated centrally or obliterated, just to evade detection or linking the individual 
with their past [6]. Fingerprints of a little proportion of visitors visiting foreign 
countries are matched against a database of well-known criminals or terrorists. Bio- 
metrics has helped in identifying and apprehending over 1600 wanted individuals for 
felony crimes [7]. This is a sign that those wanting to hide their identity in pursuit of 
their criminal motives may alter their fingerprints in order to break border and enter into 
any country without their true identity being detected. However, it is essential to detect 
such alteration types and link the altered fingerprint images to their original ones. 
Furthermore, determining the alteration type is an essential first step to reveal a sub- 
jects’ identity. 

Fingerprints can be obliterated or mutilated to systematically evade identification 
by the biometric system [2]. Fingerprint can as well be altered or grafted to various 
patterns, shapes, sizes, via surgical operation which comes in either a Z-cut or central 
rotation. Other types of alteration can be achieved by burning the fingerprints ‘oblit- 
eration’, which in turn changes the fingerprint patterns that the biometric system uses to 
match and identify individuals based on what was previously stored as the original 
fingerprint [8]. Various software application and hardware solutions are proposed [9, 
10] to tackle this situation. However, the authors focus on spoofing and distortion by 
rotating fingers on the scanner. Obfuscation is the purposeful exertion of an individual 
of concealing their identity by altering ridge patterns of their fingerprint [3]. Generally 
the alterations are categorised into three fundamental classes in view of the changes 
made to the ridge patterns of the fingerprint (i) obliteration or decimation (ii) distortion 
or bending and (iii) imitation or impersonation of fingerprint [3]. The most common 
alteration types based on the examination of ridge patterns presented by [3] are 
obliteration and distortion, which make up 89% and 10% of such alterations, respec- 
tively, whereas only 1% is reported as imitation. This shows that most of the alterations 
are either obliteration or distortion, which we seek to address in this paper. In [3], the 
proposed algorithm and reported technique identify and detect such fingerprint 
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alterations with an accuracy of 66.4%. They also emphasize on the lack of public 
available databases that comprise obliterated and distorted fingerprints, to be used for 
experimentation purposes to improve upon the detection alteration algorithms. The 
datasets used by the authors in [3] is not publically available as it is highly secured due 
to the sensitivity of the data and is mostly owned by law enforcement agency. This 
makes it difficult for the research community to proffer better solutions and robust 
detection or matching algorithms that can detect with high accuracy. 

The authors in [11], proposed various methods to generate synthetically altered 
fingerprint images, which also include a variety of noise such as scar or blurring in 
order to create a more realistic fingerprints. The authors utilised these dataset to 
develop a framework for detection or matching of altered fingerprints, where the 
alterations are obliteration, central rotation and Z-cut. The authors of [2] focused on the 
position of the alteration which is often chosen at random, since the main objective is to 
avoid being identified [2]. This alteration can be achieved by a publically available tool 
proposed by [12]; SynThetic fingeRprint AlteratioNs GEnerator (STRANGE). 

Based on previous studies in the area of fingerprints alteration, analysis and 
detection, significant gap in knowledge was identified. In Yoon et al. (2012), a case 
study compilation with automatic detection, classification and evaluation of altered 
fingerprints is done with the view of reducing the number of individual wanting to 
evade identification. This study extends [3] in determining alteration types automati- 
cally as well as introducing a new fingerprint dataset comprising real fingerprints and 
altered fingerprints for experimental purposes and replication of other academic 
researches on fingerprint alteration detection algorithms. The dataset also has some 
attributes that can open more research avenues due to its uniqueness in identifying 
gender, fingers name and either a left hand or a right hand, which has received little or 
no attention in the past. These form the current research contribution to addressing 
alterations of fingerprints, using the specific sets of fingerprints dataset in addition to 
determining the alteration type. 


3 Dataset 


SOCOFing dataset comprises a total of 6,000 real fingerprints collected from 600 
subjects, are provided for experimental and other academic research purposes. We used 
the STRANGE tool to alter fingerprints by applying Easy, Medium, and Hard settings 
according to a quality threshold during fingerprint comparison [11]. The quality 
threshold is determined by the image resolution which by default is set to 500 dbi. 
These categories are parameters that are tuned according to the performance drop 
during fingerprint comparison. Furthermore, each category mentioned above is divided 
into three types of synthetic alteration, i.e. obliteration, central rotation and Z-cut. Bach 
image will have three types of alteration in the three categories; hence each image was 
presented with nine altered images. 

The dataset is divided into altered and real fingerprints. A total of 5977 real fin- 
gerprints are altered using easy parameter setting while 5689 real fingerprints are 
altered as medium and finally a total of 4758 fingerprints real images are altered with 
hard parameter settings. Each of the three real fingerprint parameter settings produced 
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three types of alteration: obliteration, central rotation and Z-cut. For instance 5977 real 
images produced 5977 obliterated fingerprints, 5977 central rotation and 5977 Z-cut 
alteration. This means that for 5977 real fingerprints there is going to be 17931 altered 
fingerprints presented as fake in easy category. Likewise in medium category a total 
number of 17067 are presented as altered and, finally, 14274 fingerprints are altered in 
the hard category. However, for the purpose of training and testing of the convolutional 
model, the alteration types of the fingerprint images are combined together irrespective 
of the settings. A total of 55249 fingerprint images were randomly divided into 50% 
training set and 50% testing set. Note that the STRANGE tool did not find some 
fingerprint images fit for alterations with specific parameters; hence the altered images 
for each category are less than the total number of real images. Figure 1 below shows a 
sample of real fingerprint from a left hand of one subject. 


Fig. 1. Sample of real left hand of one subject. 


After applying the STRANGE tool for the three types of alterations, Fig. 2 below 
displays the altered fingerprint of the left hand of the same subject in Fig. 1. 


Fig. 2. Sample of altered left hand fingerprint into Z-cut, obliteration and central rotation, 
respectively, of the same subject. 


4 Methodology and Experimental Setup 


In this paper, we propose a deep CNN for feature extraction and classification. Deep 
CNNs have proven to be efficient in image processing related tasks and, therefore, are 
suitable for detecting fingerprints alteration types. We train and evaluate this model on 
the real and synthetically altered images of the SOCOFing dataset described above. 
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Each class, including real images, is randomly split into 50% training and 50% testing 
subsets. The images are also resized to 200 x 200 using bipolar interpolation. 


4.1 Convolutional Neural Network Model 


Convolutional neural networks retain spatial information through filter kernels. In this 
work, we exploit this unique ability of CNNs to train a model to classify images from 
the SOCOFing into four categories: central rotation, obliteration, Z-cut and real, where 
real images are those without any alteration. 

The deep CNN model has five convolutional layers with 20 3 x 3, 403 x 3, 60 
3 x 3, 803 x 3 filter kernels. All convolutional layers use a stride of one and zero 
padding of size two. Moreover, the output of every convolutional layer is shaped by a 
rectifier linear unit (ReLU) function. Max pooling is applied to the first three convo- 
lutional layers for dimensionality reduction. The convolutional layers are followed by 
two fully connected layers with 1000 and 100 hidden units, respectively. Furthermore, 
we employ batch normalization to standardize the distribution of each input feature 
across all the layers and thus speed up training and avoid exploding gradients [13]. 

The deep CNN is trained using stochastic gradient decent (SGD) and with Nesterov 
momentum of 0.5. We trained on min-batches of size 70 and set the learning rate, LR, 
to 0.01. LR was decayed with a factor of 0.01 according to: 


A 


where À denotes the initial LR, œ the decay factor and 0 the current epoch. The loss is 
defined by a SoftMax operator and the cross-entropy y is determined according to: 


y=-x.+ log Z, exp(a;)) (2) 


where c is the class ground-truth. Training was done for 100 epochs as further training 
led to overfitting. 


4.2 Residual Convolutional Neural Network Model 


Residual Neural Networks (ResNets) have demonstrated to be exceptionally effective 
models on image classification [14]. ResNets have an identity shortcut connection that 
allows for very deep architectures to be trained and, therefore, more complex features 
to be learned, leading to improved classification performance. For this reason we 
decided to compare our model with a ResNet18, that is, with 18 parametrized con- 
volutional layers, provided by [15, 16]. 

This network was originally trained and evaluated on ImageNet [17]. The authors 
also provide deeper architectures, of up to 200 layers, pre-trained on the same dataset. 
However, because fingerprint images have a relatively smaller number of features and 
the nature of the problem being addressed here is not as complex as classifying Ima- 
geNet which has 1000 classes, we did not consider deeper architectures. 
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The ResNet18 model is fine-tuned on the training subset of the SOCOFing pre- 
sented in this paper for only 5 epochs. No modifications were done to the network other 
than the replacement of the output layer to only predict four classes. Training was also 


done using SGD, a Nesterov momentum of 0.75 and a learning rate of 0.001. This 
ResNet model is then evaluated on the test subset. 


5 Results and Discussion 


The confusion matrices below show the total number of each alteration types detected 
and also the number of fingerprint images misclassified. The results are presented in 
Tables 1 and 2 with the three types of alteration, the real fingerprint images and the 
percentage accuracy of the detection of the alteration types. 


Table 1. Confusion matrix of our CNN. 


Central rotation | Obliteration Real |Z-cut | Accuracy (%) 
7995 33 0 183 97.37 

19 8148 0 44 99.23 

0 0 2988 0 100 

116 6 0 8089 98.51 
98.34% 99.52% 100% | 97.27% | 98.55 


Table 2. Confusion Matrix of the pre-trained and fine-tuned ResNet18. 


Central rotation | Obliteration | Real Z-cut | Accuracy (%) 
8206 1 1 3 99.94 
0 8195 15 1 99.81 
0 0 2986 2 99.93 
4 0 11 8196 99.82 
99.95% 99.98% 99.10% | 99.93% | 99.86 


As indicated in Table 1, 2988 cases of real fingerprint images are correctly clas- 
sified as real fingerprints. The proposed model was able to detect and classify 100% of 
the entire real fingerprints correctly. However, 98.55% of the overall predictions across 
all four classes are correct. In addition, 183 altered fingerprint images in central rotation 
are mixed up with Z-cut alteration and 116 Z-cut altered fingerprint images are mixed 
up as central rotation. This can be explained because some of the angles in the 
parameter setting of the tool used rotate the altered part of the images in a similar 
pattern coupled with the ridges pattern, radial and ulnar loop. Radial loop is a loop that 
comes from the side of the thumb and looped out to the pinky side of the hand, while 
ulnar is the opposite, i.e., from the pinky side of the hand towards the thumb of the 
fingerprint images [18]. These angle rotation contributed to the misclassification of the 
alteration between the central rotation and Z-cut, which results in getting a high number 
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of up to 183 and 116 altered fingerprint images presented as Z-cut and central rotation, 
respectively. 

Table 2 shows the pre-trained confusion matrix for the ResNet-18 model that 
achieves a global accuracy of 99.86%. It misclassifies two real fingerprint images as Z- 
cut, while the proposed CNN model classifies all the real fingerprint images correctly. 
Furthermore, 15 of the obliterated fingerprint images are misclassifies as real, while 11 
Z-cut altered fingerprint are also misclassifies as real. This may be because some of the 
real images are not of good quality and appear as obliteration. However, some loop 
ridges in the fingerprint when rotated to some certain degrees might result into some 
pattern changes that might look like Z-cut shape, hence classify them as Z-cut. In 
addition, there exist some natural cut in some of the fingerprints, which the models 
equally detect as a Z-cut shown in Fig. 3 central rotation classified as Z-cut. Some 
fingerprints also appeared to look blurring and haze, which the model classified as 
obliteration, indicated in Fig. 4 where central rotation are misclassified as obliteration. 
Figure 5 shows altered Z-cut fingerprint classified as obliteration because of the 
blurring defect of the real fingerprint at the top most of the images. As some of the 
images are from female fingers, we cannot also ruled out the possibility of them 
wearing henna as shown in the last image of Fig. 5. 

Evaluating the confusion matrixes above, we found that the accuracy rate of central 
rotation is 97.37% and 99.94% of the pre-trained model. This shows that the pre- 
trained model performs better in terms of detecting altered images with central rotation 
alteration type. Likewise, it also does better in the recall, with 99.95% against 98.34%. 
The pre-trained ResNet-18 model performs better in almost all the categories. How- 
ever, even though the detection accuracy is high on real images, with a precision of 
99.93% and recall of 99.10%, the CNN model we proposed does better with 100% 
detection for both precision and recall scores. 

The two CNN models achieved a high accuracy in the classification of altered 
fingerprint. Nevertheless, some images are still misclassified, particularly the altered 
fingerprint images. 


Fig. 4. Central rotation misclassified as obliteration. 
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Fig. 5. Z-cut misclassified as obliteration. 


From the misclassified fingerprints illustrated in Figs. 3 and 4, we can see that the 
easy alteration category fingerprints are misclassified more by the CNN model because 
they physically appeared with little proportion of the fingerprints altered, then followed 
by the medium category. The hard category fingerprints are less misclassified unless in 
the case of patterns rotational degrees that mixed central rotation with Z-cut. 

Selvarani et al. [19] use singular points to distinguish between real fingerprints and 
altered ones, by extracting sets of features from the ridge orientation field of an input 
fingerprint and then apply a fuzzy classifier to classify it into real or altered ‘Z-cut’. 
Similarly, [20, 21] introduced a classifier that detects altered fingerprint images with Z- 
cut and central rotation only using extracted features and a support vector classifier. 
This was tested using synthetic fingerprints and achieved 92% accuracy above the well- 
known fingerprint quality software, NFIQ, as it only recognised 20% of the altered 
fingerprints. We cannot therefore provide a comparison on other alterations, since, to 
the best of our knowledge, no prior work has been done on detecting these three types 
of alterations together. 

One of the main advantages of the deep CNN proposed in this work is that the 
ResNet18 was pre-trained on the ImageNet dataset, which has over one million images 
spanning over 100 classes, compared to our model, which was only trained on our 
dataset and for only 100 epochs. Our model also has a significantly smaller number of 
convolutional layers, and thus an exponentially smaller number of hyperparameters. 
Moreover, because the CNN proposed here has a precision and recall rate of 100% on 
real images, it can be more suitable for use in applications where detecting whether a 
fingerprint has been altered or not is most important. Furthermore, the performance of 
the ResNet models provided by [15] heavily relies on the image pre-processing steps, 
such as aspect ratio resizing and luminance adjustments. 


6 Conclusion 


Fingerprint alteration detection is still an issue that requires more attention in detecting 
and identifying altered fingerprints. In this paper, we have introduced a novel finger- 
prints dataset, SOCOFing, for wider research accessibility. We highlighted the 
importance of fingerprint alteration research and the need for digital automatic detec- 
tion of altered fingerprints. We also discussed the most common types of obfuscation 
and distortion: central rotation, obliteration and Z-cut. The presented dataset includes 
three different levels of alterations for each one of these types. Furthermore, the novel 
dataset presented in this paper has a number of unique attributes, such as the name of 
the fingers, which hand does the fingers belong to as well as the gender of the 
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fingerprint owner. We have also proposed aCNN model that is not only able to detect 
whether a fingerprint has been altered or not but also detect the type of alteration. The 
proposed CNN achieved an accuracy rate of 98.55% on the testing subset of the 
SOCOFing dataset. This was compared against a ResNet18 model pre-trained on 
ImageNet and fine-tuned and tested on our dataset, achieving a state-of-the-art accuracy 
rate of 99.86%. One of the main differences in performance for our model and the 
ResNet18 model was that even though the ResNet18 slightly outperformed our model, 
our model achieved a precision and recall rate of 100% on real images, thus it can be 
more suitable for real-time applications. 

To the best of the authors’ knowledge, no prior work has addressed these three 
types of alterations. However, one of the limitations of this work is that the proposed 
CNN was evaluated on synthetically altered images due to the lack of publicly 
available datasets containing actual altered images. Nonetheless, we hope that the 
results presented in this work can serve as a benchmark in identifying fingerprint 
alterations and that the novel presented dataset can assist the research community in 
developing more robust biometric fingerprint technology for the automatic detection of 
altered fingerprints. 

Future work will also investigate the reasons why the ResNet18 model confuses 
non-altered fingerprints with altered ones. Moreover, we will also test our model on 
different datasets, with different alteration types, to see if it retains 100% precision and 
recall rates on real images. 
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Abstract. In recent years, Location Based Service (LBS) providers rely 
increasingly on predictive models in order to offer their users timely 
and tailored solutions. Current location prediction algorithms go beyond 
using plain location data and show that additional context information 
can lead to a higher performance. Moreover, it has been shown that using 
semantics and projecting GPS trajectories on so called semantic trajec- 
tories can further improve the model. At the same time, Artificial Neural 
Networks (ANNs) have been proven to be very reliable when it comes 
to modeling and predicting time series. Recurrent network architectures 
show a particularly good performance. However, very little research has 
been done on the use of Convolutional Neural Networks (CNNs) in 
connection with modeling human movement patterns. In this work, we 
introduce a CNN-based approach for representing semantic trajectories 
and predicting future locations. Furthermore, we included an additional 
embedding layer to raise the efficiency. In order to evaluate our app- 
roach, we use the MIT Reality Mining dataset and use a Feed-Forward 
(FFNN) -, a Recurrent (RNN) - and a LSTM network to compare it 
with on two different semantic trajectory levels. We show that CNNs 
are more than capable of handling semantic trajectories, while providing 
high prediction accuracies at the same time. 


Keywords: Convolutional Neural Networks - Semantic trajectories 
Location prediction - Embedding layer 


1 Introduction 


With the rise in the use of smartphones, wearables and other IoT devices over 
the past decade, applications that use location data have become increasingly 
popular. In addition, in recent years, providers attempt progressively to predict 
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the locations to be visited next by the users, in order to be able to offer them 
timely and personalised services. This makes the location prediction research 
particularly important. Patterns mined from location data can provide a deep 
insight into the behaviour of mobile users. The usage of semantic knowledge 
helps diving even deeper into their behaviour. So called semantic trajectories 
encapsulate additional knowledge that can be crucial for the predictive model. 

The purpose of our paper is to present and evaluate a Convolutional Neural 
Network (CNN) architecture in a semantic location prediction scenario. First, we 
describe some related work that has been done in the realms of semantic location 
prediction, semantic location mining and CNNs. Next, we elaborate on the way 
CNNs work, by providing some relevant term definitions at the same time. In 
Sect. 4 we outline our own architecture together with some basic implementation 
details. Finally, in Sects.5 and 6, we discuss our evaluation outcome and draw 
our final conclusions with regard to our findings. 


2 Related Work 


Spaccapietra et al. depict as one of the first in their work [12] the importance of 
viewing trajectories of moving objects in a conceptual manner. They show that, 
by defining and adding semantic information, such as the notion of application- 
specific stops and moves, to the raw trajectories, they can significantly enhance 
the analysis of movement patterns, and provide further insights into object 
behaviour. Elragal et al. depict in [5] the benefits of integrating semantics into 
trajectories as well. It is shown that semantic trajectories help improve both pat- 
tern extraction and decision-making processes in contrast to raw trajectories. For 
this reason, several papers have emerged in recent years presenting approaches 
to transforming raw location data into so called semantic locations (Sect. 3.1). 
Alvares et al. for instance introduce a semantic enrichment model aiming at sim- 
plifying the query and analysis of moving objects [1]. Bogorny et al. [2] extend 
the previous approach by introducing a more general and sophisticated model, 
capable of handling more complex queries, while providing different semantic 
granularities at the same time. 

The notion of semantic trajectories has also grown in importance in the field 
of location prediction during the last years. Ying et al. [13] for example present a 
location prediction framework based on previously mined semantic trajectories 
from the users’ raw geo-tracking data. Their prefix tree decision based algorithm 
shows good performance, especially in terms of recall, f-score and efficiency. 

In their recent work, Karatzoglou et al. [7], explore the modeling and pre- 
diction performance of various artificial neural network (ANN) architectures, 
e.g., Feed-Forward (FFNN), Recurrent (RNN) and Long-Short-Term-Memory 
(LSTM) network on semantic trajectories. Similar to Ying et al. they evaluate 
their models using the MIT Reality Mining dataset [4], with the LSTM achieving 
the best results with up to 76% in terms of accuracy and outscoring the other 
methods on f-score and recall as well. In addition, they investigate the role of 
the semantic granularity of the considered trajectories in the overall performance 
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of the networks. They show that the higher the semantic level, the better the 
modeling quality of the networks. 

Lv et al. explore in [10] the possibility of using Convolutional Neural Net- 
works (CNNs) (Sect.3.2) to predict taxi trajectories. Their approach projects 
past trajectories upon a map and models them in turn as 2D images, on which 
the CNN is finally applied to estimate about future trajectories. By modeling 
trajectories as 2D images, they are able to make use of the inherent advantage of 
CNNs, namely their good performance in image analysis. This is also confirmed 
by their results. However, their approach is applied on raw, non-semantic GPS 
trajectories. 

To our knowledge, there is no work exploring the performance of CNNs on 
semantic trajectories. Moreover, it seems that there is no work using trajecto- 
ries (semantic or non-semantic ones) in combination with CNNs directly, e.g., 
without transforming them in an intermediate step into 2D images, but handling 
them in their raw form instead, as 1D vectors. In the presented work, we exam- 
ine exactly these two points in terms of prediction performance in a semantic 
location prediction scenario. For this purpose, we focused on the Natural Lan- 
guage Processing (NLP) use case where, similar to our case, the data are also 1D 
and some work has already been done in combination with CNNs. Particularly 
interesting is the work of Collobert et al. [3], who propose a CNN architecture for 
solving several NLP problems including named entity recognition and semantic 
role labelling. Their framework features an unsupervised training algorithm for 
learning internal representations, e.g., by using an embedding layer and learn- 
ing low-dimensional feature vectors of given words through backpropagation, 
yielding a good performance both in terms of accuracy and speed. The benefit 
of using embeddings has been recently shown also in connection with modeling 
human trajectories by Gao et al. in [6]. 


3 Theoretical Background 


In this section, we give a brief insight into the fundamental components of our 
work. 


3.1 Semantic Trajectories 


Movement patterns, so called trajectories, describe sequences of consecutive loca- 
tion points visited by some object or person. In ubiquitous and mobile com- 
puting, trajectories refer usually to GPS sequences like the one displayed in 
Eq. 1, whereby longi, lat; and t; refers to longitude, latitude and point of time 
respectively. 


(longi, latı, tı), (longa, late, t2),..., (longi, lati, ti) (1) 


In the attempt to add more meaning when modeling movement, researchers 
like Spaccapietra et al. [12] and Alvares et al. [1] went beyond such numeri- 
cal sequences and lay focus on conceptual, semantically enriched trajectories, 


64 A. Karatzoglou et al. 


so called semantic trajectories. A semantic trajectory is defined as a sequence 
of semantically significant locations (semantic locations, e.g., “home”, “burger 
joint”, etc.) as follows: 


(SemLocı ‚tı), (SemLoca,ta),..., (SemLoc;, ti) (2) 


A significant location usually refers to a location at which a user stays more 
than a certain amount of time, e.g. 20 min. Some researchers add further thresh- 
olds, like popularity, in order to extract the most significant common or public 
locations (see [13]). Locations can be described hierarchically over a number of 
various semantic levels, e.g., “restaurant” — “fast food restaurant” — “burger 
joint”. In this work, we evaluate the modeling performance of CNNs on two 
different semantic levels. 


3.2 Convolutional Neural Networks (CNNs) 


The most popular application area of Convolutional Neural Networks (CNNs) 
is the image classification and recognition [9]. However, CNNs can be applied 
to other areas as well, such as speech recognition and time series [8]. A CNN 
example architecture concerning the image classification use case can be seen in 


Fig. 1. 
— CAR 
— TRUCK 
— VAN 
to 0 


— BICYCLE 


CONVOLUTION + RELU POOLING CONVOLUTION + RELU POOLING FLATTEN FULLY 2 SOFTMAX 


INPUT CONNECTED 


FEATURE LEARNING CLASSIFICATION 


Fig. 1. Typical CNN architecture for Image Classification (source: [11]). 


Here, the CNN first receives an image, which is supposed to classify, as its 
input. Next, a set of convolution operations takes place in order to for the fea- 
tures to be extracted. These operations are realised by filter kernels of fixed size, 
containing learnable weights, which are sled over the input image to “search” 
for certain features. Each convolution filter output results in a new layer that 
contains the findings of that filter in the input image. These layers are then fur- 
ther processed by a pooling operation set. Pooling operations combine multiple 
outputs from filter kernels in a feature layer into a single value (e.g. by taking 
the average or maximum value of the outputs in question). The resulting pooled 


Convolutional Neural Networks for Modeling Semantic Trajectories 65 


layers can then be further processed, as shown here, by more Convolution + 
Pooling operations and as such features of a higher level can be extracted. The 
last pooled layer is flattened i.e. transformed into a single long vector containing 
all of its weights. These are then connected to a fully connected layer, which is 
further connected to the output of the network, which in this case is a Softmax 
layer, containing a field for every classifiable object, and as such representing 
the classification estimation of the network for the given input. 


4 CNNs for Semantic Trajectories - Our Approach 


As already mentioned, our network (CNN) takes semantic trajectories as input, 
like the ones defined in Sect. 3.1. For this purpose, each semantic location is 
given a unique index. After being fed into the CNN, each index value in the tra- 
jectory gets passed to a hash table (embedding layer) which assigns each index, 
and as such each semantic location, a k-dimensional feature vector (embedding), 
whereby k represents a hyperparameter set by us (Sect. 5). At the very beginning, 
our feature vectors in the lookup table are randomly initialized. These vectors 
are then trained on the available training data via backpropagation in order to 
become optimal task-specific representations. In tangible terms, for our case, this 
means that we give our model the freedom to find the optimal semantic location 
representation by itself. The resulting representations will be used as input for 
our core model. A similar idea was proposed by Collobert et al. in [3] to learn 
feature vectors that represent words in a text corpus for solving NLP problems. 
After the hash table operation, our semantic location set, initially represented 
by an x 1 vector, becomes n x k matrix. This can be seen on the left in Fig. 2 
and as self.embedded_locs_expanded in Listing 1.1. 


Home 
Work 
Restaurant 
Work 
Home 


Friends Home 


Bar 
Club 
Home 


| J l J 


n x k representation of Convolutional Layer Fully Connected Layer 
trajectory 


Fig. 2. An abstracted view on the core layers of our CNN. 


In the next step, a set of convolutional filters is applied on the resulting 
matrix. These filters span along the entire feature vector dimension and across 
multiple locations of the trajectory as can be seen in Fig.2. The number of 
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filters is a hyperparameter that can be also set by the user. Like the size k of 
the embeddings dimension described above, it can affect the performance of the 
prediction. 

The outputs of the filters are then concatenated and flattened (self.h_pool_flat 
in Listing 1.1) to make up a fully connected layer, linked to a Softmax output 
layer, which provides the final prediction about the next semantic location to be 
visited by a user. We decided against using pooling layers on the filter ouputs, 
since this led to the loss of significant feature information (e.g., locations in the 
latter part of the trajectory being more important to location prediction as the 
older ones). 

In order to train our model, we used backpropagation with the Adam opti- 
mizer. The Adam optimizer maintains an individual learning rate for each net- 
work weight and adapts them separately. This is especially effective since our 
data is quite sparse compared to other more typical problems addressed by CNNs 
such as image recognition. We used Python and the Tensorflow! library to imple- 
ment our model. To prevent overfitting, dropout is used on this flattened vector 
as shown in Listing 1.1 in line 14. 


Listing 1.1. Convolution output and flattened layer. 


# Convolution Layer 

self.convl = tf.layers.conv2d( 
inputs=self .embedded_locs_expanded , 
filters=num_filters , 
kernel_size=[filter_size , embedding_size], 
padding=” VALID” , 
name=” conv1” ) 


# Combine all the features 


filter_outputs_total = num_filters * ((sequence_length — 
filter_size) + 1) 
self. h_pool_flat = tf.reshape(self.convl, [-1, 


filter_outputs_total ]) 


# Add dropout 
self.h_drop = tf.nn.dropout(self.h_pool flat, self. 
dropout_keep_prob) 


Listing1.2 illustrates the implementation of the fully connected layer. W 
and b represent the weights and the offset respectively. Furthermore we used 
Tensorflow’s nn.softmax_cross_entropy_with_logits and reduce_mean functions to 
calculate the loss. The calculated loss is used by the Adam optimizer to adjust 
the weights of the Tensorflow graph, and as such to complete a single training 
step. 


1 https: / /www.tensorflow.org. 
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Listing 1.2. Fully connected layer and loss calculation. 


# Final (unnormalized) scores and predictions 

W= tf.get_variable( 
"W, 
shape=[filter_outputs_total, num.classes], 
initializer=tf.contrib.layers.xavier_initializer ()) 

b = tf.Variable(tf.constant (0.1, shape=[num-classes]), name=” b 
”) 

self.scores = tf.nn.xw_plus_b(self.h_drop, W, b, name=” scores” 
) 

self.predictions = tf.argmax(self.scores, 1, name=” predictions 


”) 


# Calculate mean cross—entropy loss 


losses = tf.nn.softmax_cross_entropy_with_logits(logits=self. 
scores, labels=self.input_y) 
self.loss = tf.reduce_mean (losses) 


5 Evaluation 


In order to evaluate our approach, we used the MIT Reality Mining dataset [4], 
which contains the semantically enriched tracking data of approximately 100 
users over a period of 9 months. Filtering the inconsistencies out and keeping 
the most consistent annotators left us with the two-semantic-level evaluation 
dataset of 26 users of [7]. Figure 3 illustrates the overall location distribution. 
We then extracted trajectories of a fixed length and considered the subsequent 
location to be the ground truth prediction label (see Fig.4). We shuffled the 
resulting (trajectories, label) pairs and took 90% of them for training and 10% 
for testing. We trained and evaluated both the separated single-user models, as 
well as a multi-user model that contained the trajectories of all users. In the 
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Fig. 3. Distribution of high-level semantic locations. 
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Trajectory 1 Label 1 
[Home )[ Work )[ Restaurant )[ Work | Home] 


Trajectory 2 Label 2 


Fig. 4. Data Extraction exemplified with a trajectory length of 3. 


case of the multi-user model, a single trajectory composed of all the available 
single-user trajectories was fed into the model as if it came from a single user. 
We further used the FFNN, the RNN and the LSTM from [7] as our baseline. In 
addition, since there is a timestamp present for every location visit in the Reality 
Mining data, we also tested the performance of our model when we include time 
as an extra feature. For this purpose we aggregated the available timestamps into 
hourly time slots. Finally, we evaluate a version of our model with the embedding 
layer missing. All models were evaluated in terms of Accuracy, Accuracy@k, 
Precision, Recall, and F-Score. 

We tested several trajectory lengths (2, 5, 10 and 20) on different configura- 
tions of the following hyperparameters: 


Filter Size: Width of the filter kernel, i.e. how many trajectories it encompasses. 
Number of Filters: The number of different filters the CNN learns. 
Embedding Dimension: The dimension of the learned location features. 
Dropout Probability: The percentage of neurons in the fully connected layer 
that are dropped (used to minimize overfitting). 


At the same time, we did a grid search to find the following optimal parameters 
as well: Learning Rate, Number of training Epochs and Batch Size. Both 
the results and the corresponding optimal parameter set can be found in Fig. 5. 

In general, it seems that the longer the trajectory the better our model 
performs with regard to almost all of our metrics, e.g., accuracy, precision, recall 
and F-Score. However, if they get too long, e.g., >10, the performance drops. 
Especially in terms of recall and F-Score. This could be attributed to the fact 
that human movement is characterized, up to a certain length, by a long-term 
behaviour and thus raising the considered trajectory length in the model leads 
to an improved predictive performance. 

In Fig. 6 we can see the results of our model, with and without an embedding 
layer. Both CNNs, with and without embedding layer, outperform the FFNN 
of [7] (used here as reference) with regard to all of our metrics. Additionally, 
the Embedding Layer seems to be giving a slight performance boost. Figure 7 
contains the average outcome (over all users) of our model in the single-user 
model case in contrast to the FFNN, RNN and LSTM architecture. Our CNN 
outperforms the other ANNs in terms of accuracy by 7-8%, but falls a bit short in 
terms of precision, recall and F-Score. This could be interpreted as an indication 
that the CNN is worse at predicting location transitions that show up sparsely 
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Trajectory Length|Accuracy| AccuracyQ4|AccuracyQ10 Precision |Recall|F-Score 
2 0.783 0.976 0.994 0.455 0.433 |0.443 
5 0.790 0.973 0.995 0.466 0.439/0.451 
10 0.792 [0.971 0.994 0.467 10.435 [0.45 
20 0.788 0.968 0.993 0.454 0.425 [0.438 
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Fig. 5. Impact of traj 


1 
0.9 
0.8 
0.7 
0.6 
0.5 


Accuracy 


ectory length. Filter Size: 2, Embedding Dimension: 100, Num- 
ber of Filters: 50, Dropout Probability: 0.4, Batch Size: 100, Learning Rate: 0.001, 
Number of Epochs: 10. 


Precision Recall 


0.4 

0.3 

0.2 

Tada: 
0 


F-Score 


m FFNN 


CNN no Embedding 


m CNN 


Fig. 6. Comparison of evaluation results of our architecture with and without embed- 


ding layer vs. FFNN. 


in a dataset (in our case the respective single-user datasets) compared to the 
other ANNs. On the higher semantic level the accuracy discrepancy between 
the various models is similar to the low semantic version. However, in terms of 
precision, recall and F-Score the CNN seems to perform much worse than on 
the lower semantic version. It seems to disregard locations that occur relatively 
seldom in the dataset almost completely, which leads us to this result. In both 
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Fig. 7. Comparison of evaluation results of our architecture (CNN) vs. Karatzoglou et 
al. [7] (*) on the low semantic level (single user). 
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versions of the dataset (low- and high semantic level) the embedding layer seemed 
to make a small, but still significant difference. 

Figure9 contains the comparison results between the single-user and the 
multi-user modeling method. While the multi-user evaluation achieves much 
lower accuracies (as expected), it outperforms by far the single-user dataset in 
terms of precision, recall and F-Score. This can be attributed to the fact that 
the additional user information in the multi-user model fills the gap of missing 
locations and trajectories that can be often found in the single-user models 
(Fig. 8). 
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Fig. 8. Comparison of evaluation results of our Architecture (CNN) vs. Karatzoglou 
et al. [7] (*) on the high semantic level (single user). 


Dataset 2 type Accuracy|AccuracyQ2 AccuracyQ5|Precision|Recall F-Score 
Multi User 0.688 0.885 0.969 0.53 0.428 0.474 
Single User 0.78 0.919 0.993 0.149 0.151 (0.149 


Fig. 9. Comparison of our multi- and single-user CNN models. 
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Fig. 10. Impact of time in the case of the low-level semantic representation. 


Convolutional Neural Networks for Modeling Semantic Trajectories 71 


Finally, in Fig. 10 it can be seen how adding time as an additional training 
feature affects the behaviour of our models. Similar to the results of [7], time 
seems to be having a negative influence on the prediction performance of our 
CNN model, both in terms of accuracy and F-Score. 


6 Conclusion 


In this paper, we investigate the performance of CNNs and embeddings in terms 
of modeling semantic trajectories and predicting future locations in a location 
prediction scenario. We evaluate our approach on a real-world dataset, using a 
FFNN, a RNN and a LSTM network as a baseline. We show that our CNN-based 
model outperforms all the above reference systems in terms of accuracy and is 
thus capable of modeling semantic trajectories and predicting future human 
movement patterns. However, our approach seems to be sensitive to sparse data. 
In addition, we show that, similar to the outcomes of [7], both the semantic 
representation level and the overall number of users considered for training the 
model can have a significant impact on the performance, especially with regard 
to precision and recall. In our future work, we plan to explore further the use of 
CNNs in the location prediction scenario by feeding additional semantic infor- 
mation into the model such as the users’ activity and their current companion. 
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Abstract. This paper proposes a novel approach for multi-lingual multi- 
label document classification based on neural networks. We use popular 
convolutional neural networks for this task with three different config- 
urations. The first one uses static word2vec embeddings that are let as 
is, while the second one initializes it with word2vec and fine-tunes the 
embeddings while learning on the available data. The last method initial- 
izes embeddings randomly and then they are optimized to the classifica- 
tion task. The proposed method is evaluated on four languages, namely 
English, German, Spanish and Italian from the Reuters corpus. Experi- 
mental results show that the proposed approach is efficient and the best 
obtained F-measure reaches 84%. 


Keywords: Convolutional neural network + CNN 
Document classification - Multi-label - Multi-lingual 


1 Introduction 


Nowadays the importance of multi-lingual text processing increases significantly 
due to the extremely rapid growth of data available in several languages particu- 
larly on the Internet. Without multi-lingual systems it is not possible to acquire 
information across languages. Multi-label classification is also often beneficial 
because, in the case of real data, one sample usually belongs to more than one 
class. 

This paper focuses on the multi-lingual multi-label document classification in 
a frame of a real application designed for handling texts from different sources in 
various languages. There are several possibilities how to perform a classification 
in multiple languages. Most of them learn one model in a mono-lingual space 
and then use some transformation method to pass across the languages. The 
usual document representation are word embeddings created for instance by the 
word2vec approach [8]. Contrary to this idea, we suggest one general model 
trained on all available languages. Therefore, this model is able to classify more 
languages without any transformation. 
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We use popular convolutional networks for this task with three different 
settings. The first one uses static word2vec embeddings that are not trained. 
The second one initializes the embeddings with word2vec and fine-tunes it on 
the available data. The last method initializes embeddings randomly and then 
they are, as in the previous case, optimized to the given task using available 
data. All these methods use the same vocabulary. 

To the best of our knowledge, there is no previous study, which uses one clas- 
sifier on multi-lingual multi-label data as proposed in this paper. The proposed 
approach is evaluated on four languages (English, German, Spanish and Italian) 
from the standard Reuters corpus. 


2 Related Work 


This section first presents the usage of neural networks for document classifica- 
tion and then focuses on multi-linguality. 

Feed-forward neural networks were used for multi-label document classifica- 
tion in [16]. The authors have modified the standard backpropagation algorithm 
for multi-label learning which employs a novel error function. This approach is 
evaluated on functional genomics and text categorization. 

Le and Mikolov propose [8] so called Paragraph Vector, an unsupervised 
algorithm that addresses the issue of necessity of a fixed-legth document repre- 
sentation. This algorithm represents each document using a dense vector. This 
vector is trained to predict words in the document. The authors obtain new state 
of the art results on several text classification and sentiment analysis tasks. 

A recent study on the multi-label text classification was presented by Nam 
et al. [12]. The authors use the cross-entropy algorithm instead of ranking loss 
for training and they also further employ recent advances in deep learning field, 
e.g. the rectified linear units activation and AdaGrad learning with dropout 
[11,14]. T£idf representation of documents is used as a network input. The multi- 
label classification is done by thresholding of the output layer. The approach 
is evaluated on several multi-label datasets and reaches results comparable or 
better than the state of the art. 

Another method [7] based on neural networks leverages the co-occurrence of 
labels in the multi-label classification. Some neurons in the output layer cap- 
ture the patterns of label co-occurrences, which improves the classification accu- 
racy. The architecture is basically a convolutional network and utilizes word 
embeddings as inputs. The method is evaluated on the natural language query 
classification in a document retrieval system. 

An alternative multi-label classification approach is proposed by Yang and 
Gopal [15]. The conventional representations of texts and categories are trans- 
formed into meta-level features. These features are then utilized in a learning- 
to-rank algorithm. Experiments on six benchmark datasets show the abilities of 
this approach in comparison with other methods. 

Recent work in the multi-lingual text representations field is usually based on 
word-level alignments. Klementiev et al. [5] train simultaneously two language 
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models based on neural networks. The proposed method uses a regularization 
which ensures that pairs of frequently aligned words have similar word embed- 
dings. Therefore, this approach needs parallel corpora to obtain the word-level 
alignment. Zou et al. [17] propose an alternative approach based on neural net- 
work language models using different regularization. 

Kovéisky et al. [6] propose a bilingual word representations approach based 
on a probabilistic model. This method simultaneously learns alignments and dis- 
tributed representations from bilingual data. This method marginalizes out the 
alignments, thus captures a larger bilingual semantic context. Sarath Chandar 
et al. [1] investigate an efficient approach based on autoencoders that uses word 
representations coherent between two languages. This method is able to obtain 
high-quality text representations by learning to reconstruct the bag-of-words of 
aligned sentences without any word alignments. 

Coulmance et al. [2] introduce an efficient method for bilingual word repre- 
sentations called Trans-gram. This approach extends popular skip-gram model 
to multi-lingual scenario. This model jointly learns and aligns word embeddings 
for several languages, using only monolingual data and a small set of sentence- 
aligned documents. 


3 Multi-lingual Document Classification 


3.1 Multi-lingual Document Representation 


The documents are represented as sequences of word indexes in a shared vocab- 
ulary V which is constructed in a following way. Let N be a number of the 
available languages. V,, represents the vocabulary of most frequent words in the 
given language. The shared vocabulary V is then constructed by the following 
equation 


N 
V= |] Vn (1) 

n=1 
The convolutional network we use for classification requires that the inputs 
have the same dimensions. Therefore, the documents with fewer words than 
a specified limit are padded, while the longer ones must be shortened. This is 
different from Kim’s approach [3] where documents are padded to the length 
of the longest document in the training set. We are working with much longer 
documents where the lengths vary significantly. Therefore, the shortening of some 
documents and thus losing some information is inevitable in our case. However, 
based on our preliminary experiments, the influence of document shortening is 

insignificant to document classification score. 


3.2 Neural Network Architecture 


Neural network learns a function f : d— Ca which maps document d € D to 
a set of categories Ca C C. D is the set of classified documents and C is the 
set of all possible categories. 
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We use a CNN architecture that was proposed in [9]. This architecture uti- 
lizes one-dimensional convolutional kernels which is the main difference from the 
network proposed by Kim in [3] where 2D kernels over the entire width of the 
word embeddings are used. The input of our network is a vector of word indexes 
of the length M where M is the number of words used for document represen- 
tation. The second layer is an embedding layer which represents a look-up table 
for the word vectors. It translates the word indexes into word vectors of length 
E. The document is then represented as a matrix with M rows and E columns. 
The next layer is the convolutional one. We use Nc convolution kernels of the 
size K x 1 which means we do 1D convolution over one position in the embed- 
ding vector over K input words. The following layer performs a max-pooling 
over the length M — K +1 resulting in Nc 1x E vectors. The output of this 
layer is then flattened and connected to a fully-connected layer with E nodes. 
The output layer contains |C| nodes where |C| is the cardinality of the set of 
classified categories. 

The output of the network is then thresholded to get the final results. The 
values greater than a given threshold indicate the labels that are assigned to the 
classified document. The architecture of the network is depicted in Fig. 1. This 
figure shows the processing of two documents in different languages (English and 
German) by our network. Each document is handled in one training step. The 
key concept is the shared vocabulary and the corresponding shared embedding 
layer. 


4 Experiments 


4.1 Reuters RCV1/RCV2 Dataset 


The Reuters RCV1 dataset [10] contains a large number of English documents. 
The RCV2 is a multi-lingual corpus that contains news stories in 13 languages. 
The distribution of the document lengths is shown in Fig.2. We use four lan- 
guages, namely English, German, Spanish and Italian. We prepare two settings: 
single- and multi-label ones. 


Single-Label Configuration. The single-label setting was prepared so that we 
can compare the proposed approach with the state of the art. Similarly as the 
other studies, we follow the set-up proposed by Klementiev et al. [5]. Four main 
categories are used in this setting: Corporate/industrial - CCAT, Economics - 
ECAT, Government/social - GCAT and Markets - MCAT. 

Documents containing more than one or zero main categories are filtered 
out. In total we randomly sample 15,000 documents for each language. 10,000 
documents are used for training while the remaining 5,000 is reserved for testing. 


Multi-label Configuration. In this setting we use all 103 topic codes available 
in the English documents. The number of documents for each language corre- 
sponds to the minimal number across the utilized languages which is Spanish 
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Fig. 1. The architecture of the CNN network used for multi-lingual classification. Two 
example documents are used as network input. Bach document is handled in one train- 
ing step. 
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Fig. 2. Distribution of the document lengths in word tokens. 
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in our case. Therefore we have 18,655 documents for each language where three 
fifths are used for training and the remaining two fifths for development and test 
set respectively. 


4.2 Neural Network Set-Up 


In all experiments with multi-label classification we use the same configuration 
of the CNN. We use 20,000 most frequent words from each language to create 
the vocabulary. The document length is adjusted to M = 100 words with regard 
to the distribution of the document lengths according to Fig. 2. The embedding 
length E is set to 300 which allows a direct usage of pre-trained word2vec vectors. 
The number of convolutional kernels No is 40 and its shape is set to 16 x 1. We 
use a valid mode for the convolutions. The number of neurons in the fully- 
connected layer is 256. Before the output layer and before the fully-connected 
one we add dropout layers with the probabilities set to 0.2 in both cases. Relu 
activation function is used in all layers except the output one. The output layer 
employs sigmoid function in the multi-label classification scenario. The model is 
optimized using Adaptive moment estimation (Adam) [4] algorithm and cross- 
entropy loss function. The data is shuffled in all experiments. We set the number 
of epochs to 10 in all experiments. 

The single-label model is nearly the same as the multi-label one. The only 
difference is that softmax activation function is used in the output layer. 


4.3 Single-Label Results 


Table 1 summarizes the results of the single-label classification experiments. We 
use the standard Precision (Prec), Recall (Rec), F-measure (F1) and Accuracy 
(ACC) metrics [13] and the confidence interval is +0.3% at the confidence level 
of 0.95. 

We present all three possible settings of the embedding layer. The first 
one uses static word2vec embeddings (Word emb notrain), the second one uses 
word2vec embeddings which are fine-tuned during the network training (Word 
emb train) and the last one uses randomly initialized vectors that are trained 
(Random init). 

The results show that the training of the embeddings is beneficial and allows 
achieving significantly higher recognition scores. However, the usage of static pre- 
trained embeddings also reaches reasonable accuracy while dramatically lowering 
the time needed for the network training. 

Table 2 compares the accuracies of the proposed methods with the state- 
of-the-art. As the other studies we use the standard accuracy metric in this 
experiment. 

This table clearly shows that our methods outperform significantly all the 
other approaches. This is particularly evident in the case of English language 
where the increase of accuracy is almost by 20%. We must note that the set-up 
of the other approaches slightly differ. However, the reported methods are the 
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Table 1. Results of the single-label classification experiments [in %]. 


Word emb notrain Word emb train Random init 

Prec | Rec |F1 ACC |Prec Rec |F1 | ACC Prec} Rec | F1 | ACC 
en | 93.0 [89.7 91.3 90.2 | 96.1 | 93.9 | 95.0 | 94.4 96.6 | 96.3 | 96.4 | 96.3 
de | 95.3 [94.8 95.1 95.0 | 97.0 | 96.9 | 96.9 | 96.8 96.6 | 96.3 | 96.4 | 96.3 
es | 98.7 [98.1 98.4 98.3 | 99.9 | 99.9 | 99.9 | 99.9 99.9 | 99.9 | 99.9 99.9 
it | 88.8 [86.7 87.8 86.9 |91.9 | 91.6 | 91.7 | 90.7 91.5 | 91.2 | 91.3 | 90.6 
avg | 94.0 [92.3 93.2 92.6 | 96.2 | 95.6 95.9 | 95.5 96.2 | 95.9 | 96.0 | 95.8 
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most similar set-ups we found. Moreover, to the best of our knowledge, there are 
no studies with exactly the same configuration as we use. 


Table 2. Comparison with the state of the art [accuracy in %]. 


Method [ACC in %] de |en 

Klementiev et al. [5] 77.6 |71.1 
Kovéisky et al. [6] 83.1 | 76.0 
Sarath Chandar et al. [1] | 91.8 | 74.2 
Coulmance et al. [2] 91.1 | 78.7 
Word emb notrain 95.0 | 90.2 
Word emb train 96.8 | 94.4 
Random init 96.3 | 96.3 


4.4 Multi-label Results 


Table3 shows the results of our network in the multi-label scenario. We use 
the standard Precision (Prec), Recall (Rec), F-measure (F1) metrics in this 


experiment. The confidence interval is 4 


+0.35% at the confidence level of 0.95. 


We can summarize the results in this table in a similar way as the previous 
one for the single-label classification. The training of the embeddings improves 
the obtained classification results. However, the training of randomly initial- 
ized vectors has worse results than the fine-tuned word2vec vectors. The best 
obtained F-measure 86.8% is, as in the previous case, for Spanish using word2vec 
initialized embeddings with a further training. 


4.5 Word Similarity Experiment 


The last experiment analyzes the quality of the resulting embeddings obtained 


by the three neural network settings. 
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Table 3. Precision (Prec), Recall (Rec), F-measure (F1) of the multi-label classifica- 
tion [ in %]. 


Word emb notrain | Word emb train | Random init 

Prec | Rec | F1 Prec | Rec F1 | Prec| Rec Fl 
en | 84.3 | 62.7 | 71.9 85.4 | 89.2 | 82.2 | 83.6 75.1 79.2 
de | 84.2 | 69.8 | 76.3 87.5 | 81.2 | 84.2 | 86.5 77.3 81.6 
es | 90.4 | 77.1 | 83.2 89.4 | 84.3 | 86.8 | 89.4 81.5 | 85.3 
it | 84.9 | 68.4 | 75.8 86.5 | 81.2 83.8 | 85.2 77.8 81.3 
avg | 86.0 | 69.5 | 76.8 87.2 | 81.5 | 84.3 | 86.2 77.9 81.9 


Table 4. Ten closest words to the English word “accident” based on the cosine simi- 
larity; English translation in brackets including the language of the given word. 


Word emb notrain Word emb train Random init 

Word Cos sim | Word Cos sim | Word Cos sim 

accidents 0.860 accidente 0.685 ruehe 0.248 

incident 0.740 ungliick (de, 0.632 bloccando (es, | 0.239 
misfortune) blocking) 

accidente (es, | 0.600 estrellö (es, 0.609 compelled 0.236 

accident) crashed) 

incidents 0.574 accidents 0.599 numerick 0.219 

accidentes (es, | 0.546 geborgen (de, 0.585 fiduciary 0.217 

accidents) secure) 

disaster 0.471 absturz 0.584 barriles (es, 0.216 

barrels) 

explosions 0.461 unglücks (de, 0.576 andhra 0.214 
misfortunes) 

incidence 0.452 abgestürzt (de, | 0.567 touring 0.212 
crashed) 

personnel 0.452 trümmern (de, 0.560 versicherers 0.209 
rubble) (de, insurers) 

unfall (de, 0.450 unglücksursache | 0.551 oppositioneller | 0.203 

accident) (de, ill cause) (de, 

oppositional) 


Table4 shows 10 most similar words to the English word “accident” across 
all languages based on the cosine similarity. These words are mainly in English 
when word2vec initialization without any training is used (the first column). Fur- 
ther training of the embeddings (middle column) causes that also German and 
Spanish words with a similar meaning are shifted closer to the word “accident” 
in the embedding space. On the other hand, when training from randomly ini- 
tialized vectors, the ten most similar words have often quite a different meaning. 
However, as shown in the classification results, this fact has nearly no impact 
on the resulting F-measure. We can conclude that word2vec initialization is not 
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necessary for the classification task. This table further shows that the similarity 
between Germanic (English and German) languages is clearly visible. 

Table 5 shows 10 most similar words to the English word “czech” using the 
cosine similarity. The table is very similar to the previous one. For instance, 
if we take a look at the Word emb train column, we observe that there is (as 
in the previous case) a significant decrease of the cosine similarity. However on 
the other hand, some new words, which are more related to the word “czech”, 
are included. The inapplicability to find similar words of randomly initialized 
embeddings has been confirmed. It is worth noting that although the Czech 
language is not a part of our corpus, some Czech words (praha, dnes, fronta) 
are also included due to the Czech citations available. 


Table 5. Ten closest words to the word “czech” based on the cosine similarity; English 
translation in brackets including the language of the given word. 


Word emb notrain Word emb train Random init 
Word Cos sim | Word Cos sim | Word Cos sim 
czechoslovakia 0.757 czechoslovakia 0.399 festakt (de, 0.273 
ceremony) 
slovakia 0.634 praga (es, prague) 0.335 val 0.250 
polish 0.569 republic 0.329 provence 0.235 
hungary 0.539 brno (cz, brno - 0.315 sostiene (es, hold) | 0.222 
czech city) 
hungarian 0.537 slovak 0.314 larry 0.216 
prague 0.533 praha (cz, prague) 0.313 köpfigen (de, 0.212 
headed) 
slovak 0.509 dnes (cz, today) 0.307 überschreiten (de, | 0.206 
exceed) 
praha (cz, praha) | 0.509 checa (es, czech) 0.307 aktienindex (de, | 0.205 
share index) 
austrian 0.506 fronta (cz, queue) 0.304 councils 0.205 
lithuanian 0.496 tschechoslowakei (de, | 0.297 bancario (it, 0.205 
czechoslovakia) banking) 


5 Conclusions 


In this paper we presented a novel approach for the multi-label document clas- 
sification in multiple languages. The proposed method builds on the popular 
convolutional networks. We added a simple yet efficient extension that allows 
using one network for classifying text documents in more languages. 

We evaluated our method on four languages from the Reuters corpus in both 
multi- and single-label classification scenarios. We showed that the proposed 
approach is efficient and the best obtained F-measure in multi-label scenario 
reaches 84%. We also showed that our methods outperform significantly in the 
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single-label settings all the other approaches. Another added value of this app- 
roach is also that no language identification is needed as in the case of the use 
of the single networks. 
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Abstract. Facial expressions play an important role in conveying the 
emotional states of human beings. Recently, deep learning approaches 
have been applied to image recognition field due to the discriminative 
power of Convolutional Neural Network (CNN). In this paper, we first 
propose a novel Multi-Region Ensemble CNN (MRE-CNN) framework 
for facial expression recognition, which aims to enhance the learning 
power of CNN models by capturing both the global and the local fea- 
tures from multiple human face sub-regions. Second, the weighted pre- 
diction scores from each sub-network are aggregated to produce the final 
prediction of high accuracy. Third, we investigate the effects of differ- 
ent sub-regions of the whole face on facial expression recognition. Our 
proposed method is evaluated based on two well-known publicly avail- 
able facial expression databases: AFEW 7.0 and RAF-DB, and has been 
shown to achieve the state-of-the-art recognition accuracy. 


Keywords: Expression recognition - Deep learning 
Convolutional Neural Network - Multi-region ensemble 


1 Introduction 


Facial expression recognition (FER) has many practical applications such as 
treatment of depression, customer satisfaction measurement, fatigue surveillance 
and Human Robot Interaction (HRI) systems. Ekman et al. [2] defined a set of 
prototypical facial expressions (e.g. anger, disgust, fear, happiness, sadness, and 
surprise). Since Convolutional Neural Network (CNN) has already proved its 
excellence in many image recognition tasks, we expect that it can show bet- 
ter results than already existing machine learning methods in facial expression 
prediction problems. A well-designed CNN trained on millions of images can 
parameterize a hierarchy of filters, which capture both low-level generic features 
and high-level semantic features. Moreover, current Graphics Processing Units 
(GPUs) expedite the training process of deep neural networks to tackle big-data 
problems. However, unlike large scale visual object recognition databases such 
© Springer Nature Switzerland AG 2018 
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Original image 


Face Detection Face Alignment 


Fig. 1. An overview of our approach: Multi-Region Ensemble CNN (MRE-CNN) frame- 
work. 


as ImageNet [10], existing facial expression recognition databases do not have 
sufficient training data, resulting in overfitting problems. 

CNN approaches topped the three slots in the 2014 ImageNet challenge [10] 
for object recognition task, with the VGGNet [11] architecture achieving a 
remarkably low error rate. With a review of previous CNNs, AlexNet [5] demon- 
strated the effectiveness of CNN by introducing convolutional layers followed by 
Max-pooling layers and Rectified Linear Units (ReLUs). AlexNet significantly 
outperformed the runner-up with a top-5 error rate of 15.3% in the 2012 Ima- 
geNet challenge [10]. In our proposed framework, one of the network structures 
is based on AlexNet and the other one VGG-16 is a deeper network based on 
VGGNet [11]. 

The goal of automatic FER is to classify faces in static images or dynamic 
image sequences as one of the six basic emotions. However, it is still a challeng- 
ing problem due to head pose, image resolution, deformations, and illumination 
variations. This paper is the first attempt to exploit the local characteristics 
of different parts of the face by constructing different sub-networks. Our main 
contributions are three-fold and can be summarized as follows: 


— A novel Multi-Region Ensemble CNN framework is proposed for facial expres- 
sion recognition, which takes full advantage of both global information and 
local characteristics of the whole face. 

— Based on the weighted sum operation of the prediction scores from each sub- 
network, the final recognition rate can be improved compared to the original 
single network. 


86 Y. Fan et al. 


— Our MRE-CNN framework achieves a very appealing performance and out- 
performs some state-of-the-art facial expression methods on AFEW 7.0 
Database [1] and RAF-DB [6]. 


2 Related Work 


Several studies have proposed different architectures of CNN in terms of FER 
problems. Hu et al. [4] integrated a new learning block named Supervised Scoring 
Ensemble (SSE) into their CNN model to improve the prediction accuracy. This 
has inspired us to incorporate other well-designed learning strategies to existing 
mainstream networks bring about accuracy gains. [8] followed a transfer learning 
approach for deep CNNs by utilizing a two-stage supervised fine-tuning on the 
pre-trained network based on the generic ImageNet [10] datasets. This implies 
that we can narrow down the overfitting problems due to limited expressions 
data via transfer learning. In [7], inception layers and the network-in-network 
theory were applied to solve the FER problem, which focuses on the network 
architecture. However, most of the previous methods have processed the entire 
facial region as the input of their CNN models, paying less attention to the 
sub-regions of human faces. To our knowledge, few works have been done by 
directly cropping the sub-regions of facial images as the input of CNN in FER. 
In this paper, each sub-network in our MRE-CNN framework will process a pair 
of facial regions, including a whole-region image and a sub-region image. 


3 The Proposed Method 


The overview of our proposed MRE-CNN framework is shown in Fig. 1. We will 
start with the data preparation, and then describe the detailed construction for 
our MRE-CNN framework. 


3.1 Data Pre-processing 


Datasets. Recently, Real-world Affective Faces Database! (RAF-DB)[6], which 
contains about 30000 real-world facial images from thousands of individuals, 
is released to encourage more research on real-world expressions. The images 
(12271 training samples and 3068 testing samples) in RAF-DB were downloaded 
from Flickr, after which humans were asked to pick out images related with the 
six basic emotions, plus the neutral emotion. The other database, Acted Facial 
Expressions in the Wild (AFEW 7.0)[1], was established for the 2017 Emotion 
Recognition in the Wild Challenge? (EmotiW). AFEW 7.0 consists of training 
(773), validation (383) and test (653) video clips, where samples are labeled with 
seven expressions: angry, disgust, fear, happy, sad, surprise and neutral (Fig. 2). 


1 http: //www.whdeng.cn/RAF/modell.html. 
2 https: / /sites.google.com/site/emotiwchallenge/. 
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Fig. 2. The first row displays cropped faces extracted from images in RAF-DB, and 
the second row represents faces sampled across video clips in AFEW 7.0. 


Face Detection and Alignment. For each video clip in AFEW 7.0, after 
using a face tracker [3], we sample at 3-10 frames that have clear faces with an 
adaptive frame interval. To extract and align faces both from original images 
in RAF-DB and frames of videos in AFEW 7.0, we use a C++ library, Dlib? 
face detector to locate the 68 facial landmarks. As shown in Fig. 3, based on 
the coordinates of localized landmarks, aligned and cropped whole-region and 
sub-regions of the face image can be generated in a uniform template with a 
affine transformation. In this stage, we align and crop regions of the left eye, 
regions of the nose, regions of the mouth, as well as the whole face. Then three 
pairs of images are all resized into 224 x 224 pixels. 


no 


Face image Face landmarks 


Whole-region Sub-region 


Fig. 3. The processing of the cropped whole-region and sub-regions of the facial image. 


3.2 Multi-Region Ensemble Convolutional Neural Network 


Our framework is illustrated in Fig. 1. We take three significant sub-regions of the 
human face into account: the lefteye, the nose and the mouth. Each particular 
sub-region will be accompanied by its corresponding whole facial image, forming 


3 dlib.net. 
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a double input subnetwork in Multi-Region Ensemble CNN (MRE-CNN) frame- 
work. Afterwards, based on the weighted sum operation of three prediction scores 
from each sub-network, we get a final accurate prediction. 

Particularly, to encourage intra-class compactness and inter-class separabil- 
ity, each subnet adopts the softmax loss function which is given by 


Loss(@) = —— ENS Say D) = j} | log — 
i=1 j=1 sae 


where x“) denotes the features of the i-th sample, taken from the final hidden 
layer before the softmax layer, m is the number of training data, and k is the 
number of classes. We define the i-th input feature z) € R? with the predicted 
label y;. 0 is the parameter matrix of the softmax function Loss(@). Here l{-} 
means l{a true statement} = 1 or I{a false statement} = 0. 


Data Augmentation. Despite the training size of RAF-DB, it is still insuf- 
ficient for training a designed deep network. Therefore we utilize both offline 
data augmentation and on-the-fly data augmentation techniques. The number 
of training samples increases fifteen-fold after introducing methods including 
image rotation, image flips and Gaussian distribution random perturbations. 
Besides, on-the-fly data augmentation is embedded in the deep learning frame- 
work, Caffe, by randomly cropping the input images and then flipping them 
horizontally. 


3.3 The Sub-networks in MRE-CNN Framework 


As Fig. 4 shows, we adopt 13 convolutional layers and 5 max pooling layers and 
concatenate the outputs from two pool5 layers before going through the first 
fully connected layer. The final softmax layer gives the prediction scores. When 
employing VGG-16 [11], we finetune the pre-trained model with the training set 
of AFEW 7.0 and RAF-DB, respectively, in the following experiments. 


ino con® 
g > 
„oa? con? 
go om o0% a 
Bi. = 


224*224 112*112 56*56 28*28 14*14 77. fc8 


I 
Y i g | 
ol on 
Y 
2 


softmax 


224*224 112*112 56*56 28*28 14*14 77 4096 4096 


Fig. 4. The VGG-16 sub-network architecture in MRE-CNN framework. 
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To validate the proposed MRE-CNN framework, our modified AlexNet archi- 
tecture do not use any pre-trained models during its training process. For 
AlexNet sub-network, we use 5 convolutional layers and 3 max pooling lay- 
ers, the same as in the traditional CNN architecture. Different from the original 
AlexNet, the last two fully connected layers have 64 outputs and 7 outputs, 
respectively, making it possible to retrain a deep network with limited data. The 
following experiment results indicate its effectiveness in the MRE-CNN frame- 
work structure, despite its simplified network architecture. 

Finally, we combine the three predictions from three sub-networks by con- 
ducting the weighted sum operation. The predicted emotion Pu RE-CNN is 
defined as 


¿Aa 
e92 2 
PMRE-CNN = > On J EN an ; (2) 
efi gl i) ene 
i=l ‘ 
eee 


where a,, denotes the weight for a single sub-network and z is equal to 3 as we 
utilize three sub-networks. Other parameters are the same as those in Eq. 1. 


4 Experiments 


4.1 Experimental Setup 


All training and testing processes were performed on NVIDIA GeForce GTX 
1080Ti 11G GPUs. We developed our models in the deep learning framework 
Caffe. On the Ubuntu linux system equipped with NVIDIA GPUs, training a 
single model in MRE-CNN took 4-6 hours depending on the architecture of the 
sub-network. 


4.2 Implementation Details 


In data augmentation stage, we augment the set of training images in RAF-DB 
and frames in AFEW 7.0 by flipping, rotating each with +4° and +6°, and 
adding Gaussian white noises with variances of 0.001, 0.01 and 0.015. We then 
train our VGG-16 sub-networks for 20k iterations with the following parameters: 
learning rate 0.0001-0.0005, weight decay 0.0001, momentum 0.9, batch size 16 
and linear learning rate decay in stochastic gradient descent (SGD) optimizer. 
For AlexNet sub-networks, we train them for 30k iterations with the batch size of 
64 and the learning rate begins from 0.001. In the ensemble prediction stage, the 
specific weights of MRE-CNN (VGG-16 Sub-network) are 4/7 (lefteye weight), 
2/7 (mouth weight) and 1/7 (nose weight) and those of MRE-CNN (AlexNet 
Sub-network) are 2/5 (lefteye weight), 2/5 (mouth weight) and 1/5 (nose weight), 
respectively. 
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4.3 Results on RAF-DB 


RAF-DB is split into a training set and a test set with the idea of five-fold cross- 
validation and we performed the 7-class basic expression classification bench- 
mark experiment. In the RAF-DB test protocol, the ultimate metric is the mean 
diagonal value of the confusion matrix rather than the accuracy due to imbal- 
anced distribution in expressions. In this experiment, we directly train our deep 
learning models with our processed training samples from RAF-DB, without 
using other databases. In details, after filtering the non-detected face images 
and applying data augmentation techniques, 95465 cropped face images are gen- 
erated, accompanied by lefteye images, mouth images and nose images. 


Table 1. Confusion matrix for RAF-DB based on MRE-CNN (VGG-16 sub-network). 
The term Real represents the true labels (0 = Angry, 1 = Disgust, 2 = Fear, 3 = 
Happy, 4 = Sad, 5 = Surprise, 6 = Neutral) and Pred represents the predicted value. 


Real | Pred 
0 1 2 3 4 5 6 

0 0.0088 | 0.0632 | 0.0000 | 0.0221 |0.0706 | 0.0338 | 0.8015 
1 0.0213 | 0.0182 | 0.0334 | 0.0030 | 0.0122 0.8602 | 0.0517 
2 0.0209 | 0.0565 | 0.0084 | 0.0167 | 0.7992 0.0105 | 0.0879 
3 0.0110 | 0.0211 | 0.0051 | 0.8878) 0.0127 | 0.0110 | 0.0515 
4 0.0811 | 0.0000 | 0.6081 | 0.0270 | 0.0676 | 0.1757 | 0.0405 
5 0.1125 | 0.5750 0.0063 | 0.0813 | 0.0750 | 0.0187 | 0.1313 
6 0.8395 | 0.0802 0.0185 | 0.0185 | 0.0123 | 0.0062 | 0.0247 


Analyzing the confusion matrix based on MRE-CNN (VGG-16 Sub-network) 
in Tablel, our proposed model performs well when classifying happy, surprise 
and angry emotions, with accuracy of 88.78%, 86.02%, 83.95%, respectively. 
For comparison, in Table 2 we show the results of the trained DCNN models fol- 
lowed by different classifiers which are proposed in [6]. We find that our proposed 
MRE-CNN (VGG-16) framework outperforms all of the existing state-of-the-art 
methods evaluated on RAF-DB. In addition, the MRE-CNN (AlexNet) frame- 
work also achieves a very appealing performance although we retrain the AlexNet 
sub-networks with limited data. 

Furthermore, we separated the sub-network modules from MRE-CNN frame- 
work and demonstrated their individual results on the test set of RAF-DB. 
Results can be viewed in Table3. The result of the first row shows the average 
accuracy of Face+LeftEye while applying VGG-16 sub-network in MRE-CNN 
framework, and they are higher than that of Face+Mouth. Thus we assign higher 
weights to Face+LeftEye subnet when combining the three predictions with an 
appropriate ensemble method. Face+Nose subnet is slightly less effective, prob- 
ably due to less information related to emotions; Nevertheless, it is still superior 
to the VGG-FACE model given in Table2 with only the whole face region as 
input. 
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Table 2. Performance of different methods on RAF-DB (The metric is the mean 
diagonal value of the confusion matrix). 


Angry | Disgust | Fear | Happy | Sad Surprise | Neutral | Average 
DLP-CNN+mSVM [6] 71.60 | 52.15 62.16 | 92.83 80.13 | 81.16 80.29 74.20 
DLP-CNN+LDA [6] 77.51 | 55.41 52.50 | 90.21 73.64 | 74.07 73.53 70.98 
AlexNet+mSVM [6] 58.64 | 21.87 39.19 | 86.16 | 60.88 | 62.31 60.15 55.60 
AlexNet+LDA [6] 43.83 | 27.50 37.84 | 75.78 39.33 | 61.70 48.53 47.79 
VGG+mSVM [6] 68.52 | 27.50 35.13 | 85.32 | 64.85 | 66.32 59.88 58.22 
VGG+LDA [6] 66.05 | 25.00 37.84 | 73.08 | 51.46 | 53.49 47.21 50.59 
Singe VGG-FACE 82.19 | 56.62 55.41 | 86.38 | 79.52 | 83.93 71.18 73.60 
Our MRE-CNN (AlexNet) | 77.78 | 65.62 58.11 | 87.75 75.73 | 81.16 77.21 74.78 
Our MRE-CNN (VGG-16) | 83.95 | 57.50 60.81 | 88.78 | 79.92 | 86.02 80.15 76.73 


Table 3. Sub-region comparison (the metric is the mean diagonal value of the confusion 
matrix). 


Architecture Average 
Face+LeftEye (Single VGG-16 sub-network) 76.52 
Face+Nose (Single VGG-16 sub-network) 75.64 
Face+Mouth (Single VGG-16 sub-network) 76.13 
Our MRE-CNN (VGG-16) 76.73 


Table 4. Comparisons with the state-of-the-art methods on AFEW 7.0 (the metric is 
the average accuracy of all validation videos). 


Network architecture Training data Validation (%) 
C3D [9] 16 frames for each video | 35.20 
Resnet-LSTM [9] 16 frames for each video | 46.70 
VGG-LSTM [9] 16 frames for each video | 47.40 
Trajectory+ SVM [13] 30 frames for each video | 37.37 
VGG-BRNN [13] 40 frames for each video | 44.46 
C3D-LSTM [12] Detected face frames 43.20 
Our MRE-CNN (AlexNet) Detected face frames 40.11 
Our MRE-CNN (VGG-16) | Detected face frames 47.43 


4.4 Results on AFEW 7.0 


To validate the performance of our models, we also conduct experiments on the 
validation set of AFEW 7.0. The task is to assign a single expression label from 
seven candidate categories to each video clip from the validation set (383 video 
clips). Note that all our CNN models in MRE-CNN framework are trained on 
the given training data (773 video clips) only without applying any outside data. 
Considering the temporally disappearance or occlusion in some videos, we only 
use detected face frames for training and prediction. In our experiments, the 
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predicted emotion scores of each video are calculated by averaging the scores of 
all its detected face frames. We can see from Table 4, for the validation set of 
AFEW 7.0, our MRE-CNN (VGG-16) framework gets great results which are 
superior to some state-of-the-art methods. 


4.5 Discussions 


A series of feature maps are shown in Fig.5 for VGG-16 sub-network in our 
MRE-CNN framework, which can reflect the differences in the filters of the 
first three convolutional layers. It can be observed that shallower layer outputs 
capture more profile information while deeper layer outputs encode the seman- 
tic information. Shallower layers can learn rich low-level features that can help 
refine the irregular features from deeper layers. Furthermore, by combining fea- 
tures from the whole region and sub-regions of the human face, the resulting 
architecture provides more rich feature maps, which raises the recognition rate 
for FER problems. 


Fig. 5. Visualization of the feature maps of the first three convolutional layers for the 
input image on the left of each row. 


Generally, our method explicitly inherits the advantage of information gath- 
ered from multiple local regions from face images, acting as a deep feature 
ensemble with two single CNN architectures, and hence it naturally improves 
the final predication accuracy. The disadvantage of our approach is that we use 
grid searching to determine the contribution portions of individual sub-networks, 
which is relatively computationally expensive. We shall utilize ensemble meth- 
ods like Adaboost to determine the best weights for different subnets. Although 
facial expression recognition based on face images can achieve promising results, 
facial expression is only one modality in realistic human behaviors. Combining 
facial expressions with other modalities, such as audio information, physiologi- 
cal data and thermal infrared images can provide complementary information, 
further enhancing the robustness of our models. Therefore, it is a promising 
research direction to incorporate facial expression models with other dimension 
models into a high-level framework. 
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5 Conclusion 


We have proposed a novel Multi-Region Ensemble CNN framework in this study, 
which takes full advantage of different regions of the whole human face. By 
assigning different weights to three sub-networks in MRE-CNN, we have com- 
bined the predictions of three separate networks. Besides, we have investigated 
the effects of three different facial regions, each providing different local infor- 
mation. As a result, our MRE-CNN framework has achieved a very appealing 
performance on RAF-DB and AFEW 7.0, as compared to other state-of-the-art 
methods. 
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Abstract. Data augmentation is a popular technique largely used to 
enhance the training of convolutional neural networks. Although many of 
its benefits are well known by deep learning researchers and practitioners, 
its implicit regularization effects, as compared to popular explicit regu- 
larization techniques, such as weight decay and dropout, remain largely 
unstudied. As a matter of fact, convolutional neural networks for image 
object classification are typically trained with both data augmentation 
and explicit regularization, assuming the benefits of all techniques are 
complementary. In this paper, we systematically analyze these techniques 
through ablation studies of different network architectures trained with 
different amounts of training data. Our results unveil a largely ignored 
advantage of data augmentation: networks trained with just data aug- 
mentation more easily adapt to different architectures and amount of 
training data, as opposed to weight decay and dropout, which require 
specific fine-tuning of their hyperparameters. 


Keywords: Data augmentation - Regularization - CNNs 


1 Introduction 


Data augmentation in machine learning refers to the techniques that synthet- 
ically expand a data set by applying transformations on the existing exam- 
ples, thus augmenting the amount of available training data. Although the new 
data points are not independent and identically distributed, data augmentation 
implicitly regularizes the models and improves generalization, as established by 
statistical learning theory [31]. 

Data augmentation has been long used in machine learning [27] and it has 
been identified as a critical component of many models [6,21,22]. Nonetheless, 
the literature lacks, to our knowledge, a systematic analysis of the implicit regu- 
larization effect of data augmentation on deep neural networks compared to the 
most popular regularization techniques, such as weight decay [12] and dropout 
[29], which are typically used all together. 

In a thought-provoking paper [34], Zhang et al. concluded that explicit regu- 
larization may improve generalization performance, but is neither necessary nor 
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by itself sufficient for controlling generalization error. They observed that remov- 
ing weight decay and dropout does not prevent the models from generalizing. 
Although they performed some ablation studies with data augmentation, they 
considered it just another explicit regularization technique. In a follow up study 
[16], it is argued that data augmentation should not be considered an explicit 
regularizer and it is shown that explicit regularization may not only be unneces- 
sary, but data augmentation alone can achieve the same level of generalization. 

Here, we build upon the ideas from [16] and, using the same methodology, 
we extend the analysis of data augmentation in contrast to weight decay and 
dropout. In particular, we focus here on the capability of data augmentation 
to adapt to deeper and shallower architectures as well as to successfully learn 
from fewer examples. We find that networks trained with data augmentation, 
but no explicit regularizers, outperform the networks trained with all techniques, 
as is common practice in the literature. We hypothesize that weight decay and 
dropout require fine-tuning of their hyperparameters in order to adapt to new 
architectures and amount of training data, whereas the new samples generated by 
data augmentation schemes are useful regardless of the new training conditions. 


1.1 Related Work 


Data augmentation was already used in the late 80’s and early 90’s for handwrit- 
ten digit recognition [27] and it has been identified as a very important element 
of many modern successful models, like AlexNet [21], AI-CNN [28] or ResNet 
[15], for instance. In some cases, heavy data augmentation has been applied with 
successful results [32]. In domains other than computer vision, data augmenta- 
tion has also been proven effective, for example in speech recognition [19], music 
source separation [30] or text categorization [24]. 

Bengio et al. [3] focused on the importance of data augmentation for recog- 
nizing handwritten digits through greedy layer-wise unsupervised pre-training 
[4]. Their main conclusion was that deeper architectures benefit more from data 
augmentation than shallow networks. Zhang et al. [34] included data augmenta- 
tion in their analysis of the role of regularization in the generalization of deep 
networks, although it was considered an explicit regularizer similar to weight 
decay and dropout. The observation that data augmentation alone outperforms 
explicitly regularized models for few-shot learning was also made by Hilliard 
et al. in [18]. Only few works reported the performance of their models when 
trained with different types of data augmentation levels, as is the case of [11]. 

Recently, the deep learning community seems to have become more aware of 
the importance of data augmentation. New techniques have been proposed [7,8] 
and, very interestingly, models that automatically learn useful data transforma- 
tions have also been published lately [2,13,23,26]. Another study [25] analyzed 
the performance of different data augmentation techniques for object recognition 
and concluded that one of the most successful techniques so far is the traditional 
transformations carried out in most studies. Finally, a preliminary analysis of 
the implicit regularization effect of data augmentation was presented in [16], 
showing that data augmentation alone provides at least the same generalization 
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performance as weight decay and dropout. The present work follows up on those 
results and extends the analysis. 


2 Experimental Setup 


This section describes the procedures we follow to explore the potential advan- 
tages of data augmentation to adapt to changes in the amount of training 
data and the network architecture, compared to the popular explicit regular- 
izers weight decay and dropout. We build upon the methodology already used 
in [16]. 


2.1 Network Architectures 


We test our hypotheses with two well-known network architectures that achieve 
successful results in image object recognition: the all convolutional network, All- 
CNN [28]; and the wide residual network, WRN [33]. 


All Convolutional Net. The original architecture of All-CNN consists of 12 
convolutional layers and has about 1.3 M parameters. In our experiments to 
compare data augmentation and explicit regularization in terms of adaptability 
to changes in the architecture, we also test a shallower version, with 9 layers and 
374K parameters, and a deeper version, with 15 layers and 2.4 M parameters. 
The three architectures can be described as follows: 


2x96C3(1)-96C3(2)-2x192C3(1)-192C3(2)-192C3(1)-192C1(1) 


Original -N.C1.C1(1)-Gl.Avg.-Softmax 

Shallower 2x96C3(1)-96C3(2)-192C3(1)-192C1(1) 
-N.C1.C1(1)-Gl.Avg.-Softmax 

Deeper 2x96C3(1)-96C3(2)-2x 192C3(1)-192C3(2)-2x192C3(1) 


-192C3(2)-192C3(1)-192C1(1)-N.C1.C1(1)-Gl.Avg.-Softmax 


where KCD(S) is a D x D convolutional layer with K channels and stride S, 
followed by batch normalization and a ReLU non-linearity. N.Cl. is the number 
of classes and Gl.Avg. refers to global average pooling. The network is identical 
to the All-CNN-C architecture in the original paper, except for the introduction 
of batch normalization. We set the same training parameters as in the original 
paper in the cases they are reported. Specifically, in all experiments the All- 
CNN networks are trained using stochastic gradient descent (SGD) with batch 
size of 128, during 350 epochs, with fixed momentum 0.9 and learning rate of 
0.01 multiplied by 0.1 at epochs 200, 250 and 300. The kernel parameters are 
initialized according to the Xavier uniform initialization [9]. 
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Wide Residual Network. WRN is a residual network [15] with more units per 
layer than the original ResNet, that achieves better performance with a smaller 
number of layers. In our experiments we use the WRN-28-10 version, with 28 
layers and about 36.5 M parameters. The details of the architecture are the 
following: 


16C3(1)-4x160R-4x320R-4x640R-BN-ReLU-Avg.(8)-FC-Softmax 


where KR is a residual block with residual function BN-ReLU-KC3(1)-BN- 
ReLU-KC3(1). BN is batch normalization, Avg.(8) is spatial average pooling 
of size 8 and FC is a fully connected layer. The stride of the first convolution 
within the residual blocks is 1 except in the first block of the series of 4, where 
it is 2 to subsample the feature maps. As before, we try to replicate the training 
parameters of the original paper: we use SGD with batch size of 128, during 200 
epochs, with fixed Nesterov momentum 0.9 and learning rate of 0.1 multiplied 
by 0.2 at epochs 60, 120 and 160. The kernel parameters are initialized according 
to the He normal initialization [14]. 


2.2 Data 


We train the above described networks on both CIFAR-10 and CIFAR-100 [20]. 
CIFAR-10 contains images of 10 different classes and CIFAR-100 of 100 classes. 
Both data sets consist of 60,000 32 x 32 color images split into 50,000 for train- 
ing and 10,000 for testing. In all our experiments, the input images are fed into 
the network with pixel values in the range [0, 1] and floating precision of 32 bits. 
Every network architecture is trained with three data augmentation schemes: no 
augmentation, light and heavier augmentation. The light scheme only performs 
horizontal flips and horizontal and vertical translations of 10% of the image size, 
while the heavier scheme performs a larger range of affine transformations, as 
well as contrast and brightness adjustment. We use identical schemes as in [16], 
where more details are given in an appendix. It is important to note though, that 
the light scheme is adopted from previous works such as [10,28], while the heav- 
ier scheme was first defined in [16], without aiming at designing a particularly 
successful scheme, but rather a scheme with a large range of transformations. 


2.3 Training and Testing 


We train every model with the original explicit regularization, that is weight 
decay and dropout, as well as with no explicit regularization. Besides, we test 
both models with the three data augmentation schemes: light, heavier and no 
augmentation. The test accuracy we report results from averaging the softmax 
posteriors over 10 random light augmentations. 

All the experiments are performed on the neural networks API Keras [5] on 
top of TensorFlow [1] and on a single GPU NVIDIA GeForce GTX 1080 Ti. 
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3 Results 


In this section we present and analyze the performance of the networks trained 
with different data augmentation schemes and with the regularizers on and off. 
We are interested in comparing data augmentation and explicit regularization 
regarding two different aspects: the performance when the training data set is 
reduced to 50% and 10% of the available examples and the performance when 
the architecture is shallower and deeper than the original. The presentation 
of the results in Figs.1 and 2 aims at enabling an easy comparison between 
the performance of a given network on a particular data set, when it has been 
trained with weight decay and dropout and when it has no explicit regularization 
(red and purple bars, respectively). The figures also allow a comparison of the 
performance between the different levels of regularization (color saturation). 


3.1 Reduced Training Sets 


The performance of All-CNN and WRN trained with only 50 and 10% of the 
available data is presented in Fig. 1. From a quick look at the accuracy bars it 


AII-CNN 
50% 


none light heavier 
10% No Reg. = EEE 
WD+Dropout = mE EEE 


59 60 Gl 62 63 G4 65 66 G7 68 69 70 71 72 73 74 75 76 77 78 79 80 Bl 82 83 BA 85 86 87 BB 89 9% 9 92 


CIFAR-10 


WRN none light heavier 
10% No Reg. = = mE 


WD+Dropout mE = EEE 


19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 58 69 70 


CIFAR-100 


Fig. 1. Test performance of the models trained with weight decay and dropout (red) 
and the models trained without explicit regularization (purple) when the amount of 
available training data is reduced. In general, the latter outperform the regularized 
counterparts and the differences become larger as the amount of training data decreases. 
(Color figure online) 
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already becomes clear that the models trained without any explicit regularization 
(purple bars) outperform the models trained with weight decay and dropout 
(red bars). This is true for almost all the models trained with heavier data 
augmentation (darkest bars). Only in the case of WRN trained with 50% of 
CIFAR-10, the accuracy of the regularized model is marginally better (<0.001). 
Otherwise, it seems that turning off the explicit regularizers not only does not 
degrade the performance, but it helps achieve even better generalization. 

The differences become even greater as the amount of training examples 
gets smaller, in view of the results of training with only 10% of the data. In 
these cases, the non-regularized models clearly outperform their counterparts. 
We hypothesize that this may occur because the value of the hyperparameters of 
weight decay and dropout, which were tuned to achieve state-of-the-art results 
with 100% of the data in the original publications, are not suitable anymore 
when the training data changes. It may be possible to improve the performance 
of the regularized models by adapting the value of the hyperparameters, but 
that would require a considerable amount of time and effort. On the contrary, it 
seems that the same data augmentation scheme helps generalize even when the 
training data set gets smaller. 

The great implicit regularization effect of data augmentation becomes evident 
by looking at the large performance gap between the light scheme and no data 
augmentation. It seems that just a small set of simple transformations help 


none light heavier 
Original No Reg. m = mE 
WD+Dropout = mm BEE 


u es 


Shallower 


CIFAR-10 


Original 


Deeper 


none light heavier 
Shallower No Reg. UM mu 
WD+Dropout EEE mE i 


67 68 69 70 71 m 7 


CIFAR-100 


Fig. 2. Test performance of the models trained with weight decay and dropout (red) 
and the models trained without explicit regularization (purple) on shallower and larger 
versions of All-CNN. In all the models trained with weight decay and dropout, the 
change of architecture results in a dramatic drop in the performance, compared to the 
models with no explicit regularization. (Color figure online) 
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the networks reduce the generalization gap by a large margin. In all cases the 
regularization effect is much larger than the one of weight decay and dropout. 


3.2 Shallower and Deeper Architectures 


Figure 2 shows the accuracy of All-CNN when we increase or reduce the depth 
of the architecture. If no explicit regularization is included (purple bars), we 
observe that the deeper architecture improves the results of the original network 
on both data sets, while the shallower architecture suffers a slight drop in the 
performance. In the case of the models with weight decay and dropout (red bars), 
not only is the performance much worse than their non-regularized counterparts, 
but even the deeper architectures suffer a dramatic performance drop. This seems 
to be another sign that the value of hyperparameters of weight decay and dropout 
largely depend on the architecture and any modification requires the fine-tuning 
of the regularization parameters. That is not the case of data augmentation, 
which again seems to easily adapt to the new architectures because its potential 
depends mostly on the type of training data. 


4 Discussion and Conclusion 


This work has extended the insights from [16] about the futility of using weight 
decay and dropout for training convolutional neural networks for image object 
recognition, provided enough data augmentation is applied. In particular, we 
have focused on further exploring the advantages of data augmentation over 
explicit regularization, in terms of its adaptability to changes in the network 
architecture and the size of the training set. 

Our results show that explicit regularizers, such as weight decay and dropout, 
cause significant drops in performance when the size of the training set or the 
architecture changes. We believe that this is due to the fact that their hyper- 
parameters are highly fine-tuned to some particular settings and are extremely 
sensitive to variations of the initial conditions. On the contrary, data augmenta- 
tion adapts more naturally to the new conditions because its hyperparameters, 
that is the type of transformations, depend on the type of training data and 
not on the architecture or the amount of available data. For example, a model 
without neither weight nor dropout slightly improves its performance when more 
layers are added and therefore the capacity is increased. However, with explicit 
regularization, the performance even decreases. 

These findings contrast with the standard practice in the convolutional net- 
works literature, where the use of weight decay and dropout is almost ubiquitous 
and believed to be necessary for enabling generalization. Furthermore, data aug- 
mentation is sometimes regarded as a hack that should be avoided in order to 
test the potential of a newly proposed architecture. We believe instead that these 
roles should be switched, because in addition to the results presented here, data 
augmentation has a number of other advantages: it increases the robustness of 
the models against input variability without reducing the effective capacity and 
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may also enable learning more biologically plausible features [17]. We encour- 
age future work to shed more light on the benefits of data augmentation and 
the handicaps of ubiquitously using explicit regularization, specially on research 
projects, by testing new architectures and data sets. 
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Abstract. Drug-target interactions (DTIs) are a critical step in the technology 
of new drugs discovery and drug repositioning. Various computational algo- 
rithms have been developed to discover new DTIs, whereas the prediction 
accuracy is not very satisfactory. Most existing computational methods are 
based on homogeneous networks or on integrating multiple data sources, 
without considering the feature associations between gene and drug data. In this 
paper, we proposed a deep-learning-based hybrid model, DTI-RCNN, which 
integrates long short term memory (LSTM) networks with convolutional neural 
network (CNN) to further improve DTIs prediction accuracy using the drug data 
and gene data. First, we extracted potential semantic information between gene 
data and drug data via a LSTM network. We then constructed a CNN to extract 
the loci knowledge in the LSTM outputs. Finally, a fully connected network was 
used for prediction. The results comparison shows that the proposed model 
exhibits better performance. More importantly, DTI-RCNN is stable and effi- 
cient in predicting novel DTIs. Therefore, it should help select candidate DTIs, 
and further promote the development of drug repositioning. 


Keywords: DTIs - Hybrid model - LSTM - CNN - Drug repositioning 


1 Introduction 


In the technology of new drugs discovery and drug repositioning, a critical step is the 
prediction of drug-target interactions (DTIs). Although the technology of biological 
experiments has made great progress, the discovery of new DTIs is still a challenging 
work [1]. The currently known DTIs account for a very small proportion of the total 
DTI data [2], so finding an efficient method of screening effective new DTIs from a 


large number of drug-target data is a very meaningful task. 


The first two authors should be regarded as Joint First Authors. 
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In the past decade, machine learning methods have been adopted to the discovery 
of DTIs. The importance of structured knowledge and collective classification for drug- 
target prediction was discussed by Fakhraei et al. [3]. Bleakley and Yamanishi used a 
support vector machine framework to predict DTIs based on a bipartite local model 
(BLM) [4]. Mei et al. further improved this framework by introducing a neighbor-based 
interaction-profile inferring (NII) procedure into BLM (called BLMNII), which can 
extract DTI features from neighbors and predict interactions for new drug or target 
candidates [5]. Laarhoven et al. proposed a Gaussian interaction profile (GIP) kernel to 
represent the interactions between drugs and targets, and they combined RLS with the 
GIP kernel for DTI prediction problems [6, 7]. Wang and Zeng proposed a method 
based on the RBM model that could be used to predict multi-type associations and has 
shown its powerful performance in multi-type DTI prediction [8]. These prediction 
methods mainly focus on exploiting information from homogeneous networks and 
have performed well in some datasets. Recently, a number of computational strategies 
based on deep learning have also been introduced to address the problem. For example, 
Wen et al. extended the RBM to deep learning by creating a DBN called DeepDTIs, 
that can predict interactions from different data sources including chemical structures 
and protein sequence features [8, 9]. Unterthiner et al. combined multi-task learning 
with deep networks, which was applied to good effect on the ChEMBL database [10, 
11]. These methods use a variety of data sources, but the associations between drug and 
gene data were less considered. Xie et al. developed a deep neural network to predict 
new DTIs based on the L1000 database [12] and obtained good performance [13]. 
However, Xie’s model only combined drug with gene data simply, and did not consider 
the connection between these two features. 

In this study, we proposed a deep-learning-based hybrid model, named DTI- 
RCNN, that integrates a long short term memory (LSTM) networks with a convolu- 
tional neural network (CNN) to further improve DTIs prediction accuracy using drug 
and gene data. The main novelty lies in that we introduce the LSTM network to obtain 
the relationship between the drug and gene data. Then, the features of the LSTM 
network output are input into the CNN to extract the knowledge between different loci. 
With this hybrid architecture, DTI-RCNN has excellent prediction performance. Fur- 
thermore, it can provide a practical tool for predicting unknown DTIs from the L1000 
database, providing new insights for drug discovery or repositioning and understanding 
of drug action mechanisms. 


2 Methods 


2.1 Data Source 


The Library of Integrated Network-based Cellular Signatures (LINCS) project is a 
Common Fund program administrated by the U.S. National Institutes of Health (NIH). 
The funds for this project enabled the generation of approximately one million gene 
expression profiles using the L1000 technology [14]. It reduces the number of gene 
expressions that need to be measured from more than 20,000 to 978. We can obtain a 
unified and extensive source of transcriptome data from this database. For the work 
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described in this paper, we collected drug perturbation and gene knockout perturbation 
data from the following seven cell lines: A375, A549, HAIE, HCC515, HEPG2, PC3, 
VCAP. 

The DrugBank database is a comprehensive drug data source, that records chem- 
ical, pharmacological, and pharmaceutical feature [15]. In order to obtain the complete 
DTI data, the PubChem ID was used as a drug identifier. 


2.2 Construction of Positive and Negative Samples 


In this study, we modeled the DTI prediction problem as a binary classification task and 
applied DTI-RCNN to it. From the L1000 and DrugBank databases we were able to 
obtain drug perturbation, gene knockout trails, and DTI pairs for the above listed seven 
cell lines. Some of gene knockout trails are target proteins while others are not. We 
treat each drug target reaction pair as a positive sample while considering the com- 
bination of drug data and non-target protein gene data as a negative sample. In order to 
avoid the fact that too many negative samples lead the final training model to be more 
inclined to predict the sample as negative, we extracted negative samples uniformly to 
keep the ratio of the positive to the negative samples as 1:2. 

As mentioned above, the dimension of the gene expression profile obtained by the 
L1000 biotechnology is 978, and a sample includes both drug perturbation and gene 
knockout trail. However, unlike other methods, we do not directly concatenate drug 
data with gene knockout trail into one vector. Instead, we place gene disturbance data 
and drug data in order to form a 2 x 978 matrix, so that the LSTM network can fully 
learn the semantic correlation information between the gene knockout trail and drug 
data. The feature matrix for each input sample is denoted as follows: 


Xi = O 8i (1) 
GE janie dl, ..., d 

where x; denotes the i” sample, gi and dÍ represent the j} drug feature and the j” gene 

feature of the i” sample respectively, and n is the dimension of the drug and gene 

features. 


2.3 Hybrid Model Construction 


In this paper, we developed a hybrid model DTI-RCNN, integrating a LSTM network 
and a CNN to solve the DTIs prediction problem. Figure 1 shows the architecture of 
our DTI-RCNN, which is a two-part network structure. The first part is a simplified 
version of the LSTM network, and the second part is a CNN. 

When the positive and negative samples were generated, the input feature of each 
sample collected was a gene-drug pair, which is a 2 x 978 matrix. To deal with the 
semantic relationship between genes and drug characteristics, the recurrent units in the 
recurrent neural network (RNN) were replaced by the LSTM network, allowing the 
gene and drug information to fully fuse. In the LSTM network, the hidden layer 
contains multiple memory cells. Since the units of hidden layer also play a role in 
encoding features, the number of units (N) is generally smaller than the dimensions of 


DTI-RCNN: New Efficient Hybrid Neural Network Model 107 


the input features. A gene-drug pair is input into the LSTM network as a short sequence 
of two, so gene feature is processed first, followed by drug feature. It should be noted 
that when gene and drug features enter the LSTM network, they will be multiplied by 
the same set of parameter matrices, that is, their parameters are shared. The output of 
the gene feature after the LSTM process will be input into the network together with the 
drug feature. It is because of this operation that we can analyze the semantic infor- 
mation between gene and drug features. Finally, each of the gene and the drug features 
will output one vector after the LSTM process. We then combine the two vectors 
together to form a2 x N matrix and use it as input to the CNN. 
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Fig. 1. DTI-RCNN architecture. 


2.4 Learning Semantic Information via a LSTM Network 


Recurrent neural networks (RNNs) are a variant of neural networks in which units are 
connected along a sequence [16]. RNNs were proposed to process sequence infor- 
mation. The specific manifestation is that the network will memorize the previous 
information and apply it to the calculation of the current output, that is, the units 
between the hidden layers are connected, and the input of the hidden layer includes 
both the output of the input layer and the output of hidden layers at the last moment. 
Considering the characteristics of RNNs, we used a RNN to learn the relationship 
between drug and gene data. 

In standard RNNs, the recurrent hidden module has only a very simple structure 
that is a non-linear activation function. Given a sequential sample x; = (g;, d;), RNNs 
will update its hidden state s, by 


sı =f(Ug;) 
le = f(Ud; We) 2) 
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where U is the hidden layer parameter matrix of the current input feature, and W is the 
parameter matrix of the hidden-layer output sı in the last time step, and f(-) is generally 
a non-linear activation function, such as tanh or a ReLU function. 

However, from the Eq. (2) we can see that the fusion of gene and drug data is only 
achieved through simple dot multiplication and addition operation that are similar to 
the calculation after a simple splicing operation, and cannot learn the correlation 
between the drug and the data thoroughly. A LSTM network, first proposed in Ref. 
[17], can fully integrate the prior information and the current input data in the hidden- 
layer module because of its special hidden-layer structure. Unlike a single neural 
network layer, a LSTM network’s hidden-layer has four network layers that interact in 
a very special way. Simultaneously, it introduces a new hidden layer state named cell 
state c,. The LSTM network updates information to the cell state to realize the fusion of 
information at different times. The operations are summarized in [17], and are 


h = o(W; š [sı, dj] + by) (3) 
{ in = (Wi. |sı, di] + bi) (4) 
Ca = tanh(W. i [sı, di] ale be) 
Co =f, * CY +h * č (5) 
P2 = o(W, . [s1, di] + b,) 
{ s2 ats x tanh(c2) i (6) 


where o(-) denotes a sigmoid function with an output between 0 and 1, bis a bias term, 
d; is the drug feature of the a sample, s; is the state of the hidden layer at the last 
moment whose only input is only gene feature g;, and W is the parameter matrix of sı 
and dj. 

In the LSTM network calculation process, all parameter matrices are shared 


regardless of whether the input are gene features or drug features. 


2.5 Extracting Loci Information Through a CNN 


A CNN is a deep network structure that has been widely used in the fields of computer 
vision, speech recognition, text processing and other artificial intelligence processes. In 
recent years, it has also been used in drug-drug interactions prediction tasks [18]. The 
purpose of using a CNN is to fuse the same locus features. 

For the context feature generated by a LSTM network, we designed a convolutional 
layer and a pooling layer according to the dimension of the matrix. In the convolutional 
layer, we designed multiple convolution kernels as encoders to fully extract the infor- 
mation of the features in multiple perspectives. The convolution process plays a role in 
re-encoding that can reduce the error caused by the redundant information and can 
enhance the effect of effective information. As mentioned above, the context feature 
output after passing through the LSTM network is a 2 x N matrix. Based on this, we 
design the convolution kernel of size 2 x L as the encoders, with a value of L greater 
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than 2 and less than N. In the end, each convolution kernel is assigned a set of 
(N —L+1) x 1 vectors. The value of each cell yz in the vector is calculated as follows: 


Ve = a > L; jSk+i-1j (7) 


where 1 <k<N — L+ 1. In this paper, we set up M different convolution kernels, and 
then the result of the convolution is a matrix of (V—L+1) x M. 

The convolutional layer generally is followed by the pooling operation. CNNs in 
computer vision generally use a max-pooling layer to guarantee the translation 
invariance of the image. Instead, we use mean-pooling operation to fuse features 
extracted from the convolutional layer in the pooling layer. 


2.6 Assessment of the Model Performance 


For binary classification tasks, the indicators used to evaluate the performance of the 
model mainly include AUC and Precision, which are also adopted in this paper. 
AUC is the area under the receiver operating characteristic (ROC) curve. It can well 
measure the overall performance of the model. The higher the AUC value, the better 
the classification performance of the model. 
Unlike AUC, Precision focuses on valuation of the accuracy of prediction models 
for positive samples. 


3 Results 


We sampled the positive and negative samples from the seven cell lines uniformly at a 
ratio of 1:2, and placed them in the model for training and testing. The performance of 
the model under different parameters was mainly discussed, and the best model 
parameters were obtained in each experiment according to the tenfold cross-validation 
method. Finally, the model with the best performance after training was used for DTIs 
prediction. 


3.1 The Impact of Hyper Parameters on Model Performance 


Here, we discuss the effects of several hyper-parameters on the performance of the 
model. In order to find high-performance model parameters, we designed multiple sets 
of different experiments for each parameter to verify the prediction results. For all 
experimental results reported in Figs. 2 and 3 we used the same network structure 
summarized in Table 1 except for the number of neurons in the LSTM hidden layer and 
the size of the convolution kernel. 

The LSTM hidden layer can extract association information associated between 
gene and drug data. In addition, it can encode gene and drug features. Considering that 
the features of the gene and drug put into the model are represented as a vector of 
length 978, and the number of units in the hidden layer is generally smaller than the 
number of input features, we designed seven different numbers of LSTM hidden-layer 
units, fully considering the effect of the LSTM hidden-layer units in different quantities 
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Table 1. Parameter settings for hybrid model 


Parameters Range 

LSTM neurons [100, 200, 300, 400, 500, 600, 700] 
Number of LSTM layers 2 

Convolution kernel size [5, 10, 15, 20, 25, 30, 35] 

Number of convolution kernel | 300 

Fully connected neurons 10 

Epoch 80 

Batch size 64 

Optimizer Adam 

Learning rate 0.001 


on model performance. In this group of experiments, we set the size of the convolution 
kernel to 30. The experimental results are shown in Fig. 2. 

As show in Fig. 2, when the number of LSTM hidden-layer units is equal to 400, 
DTI-RCNN can achieve the best classification performance in most cell lines. For most 
cell lines, the model’s classification performance was enhanced with increasing number 
of neurons, but when the number exceeds a certain threshold, the classification per- 
formance gradually degrades. We speculate this is because when the number of neu- 
rons increases, the model can better learn the correlation information between gene and 
drug features. However, when the number of neurons is too large, the LSTM model 
cannot extract the high dimensional features of gene and drug data, and too much 
redundant information blurs the association between them. When the number of neu- 
rons is equal to 100, DTI-RCNN in some cell lines can also learn higher dimensional 
correlation information and feature representations. 


AUC 1 Precision 
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Fig. 2. Impact of the number of LSTM hidden-layer units. The abscissa is the number of LSTM 
hidden-layer units. The number of LSTM hidden-layer units is set in the range [100, 200, 300, 
400, 500, 600, 700]. 


Since different sizes of convolution kernels can learn different feature representa- 
tions, we tested the model performance of multiple 2 x k convolution kernels. Con- 
sidering that the feature dimension of the LSTM network output is above 100, we set 
the initial value of k to be relatively large, i.e., equal to 5. Meanwhile, in order to obtain 
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more suitable parameters, we gradually increase the size of the convolution kernel, and 
carried out experiments for k in the range [5, 35]. The number of LSTM hidden-layer 
units is 400 in these experiments. The effect of different k values on model performance 
is shown in Fig. 3. We can see that different convolution kernels influence the model 
performance. When the k value is equal to 30, DTI-RCNN achieves the best classifi- 
cation results in the four cell lines (A375, A549, HEPG2, and PC3). 

For cell lines HAIE and VCAP, the model achieved the maximum AUC and 
Precision when k is equal to 25, and the best classification is obtained when k is equal 
to 20 for cell line HCC515. 


AUC Precision 
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Fig. 3. Impact of the convolution kernel size. The abscissa is the size of the convolution kernel, 
which is set in the range [5, 10, 15, 20, 25, 30, 35]. 


It can be seen that the hybrid model classification ability is enhanced with 
increasing k value, but after k exceeds a certain threshold, the performance of the model 
starts to degrade. In general, when the k value is between 20 and 30, the convolutional 
network can well learn both the global and the local features of the LSTM output 
features. When the k value is less than this range, the amount of feature information 
extracted by the convolutional network is insufficient; when it is larger than this range, 
the convolutional network will focus on learning the high-dimensional global infor- 
mation; while ignoring the information of the same locus between the gene and the 
drug data. This leads to a decrease in the classification performance of the model. 


3.2 Comparison with Other Models 


Based on the above experimental results, we have found a set of parameters that exhibit 
relatively good classification performance. These parameters are listed in Table 1. And 
according to Figs. 2 and 3, we set the number of LSTM hidden layer units of the hybrid 
model to 400 and the convolution kernel size to 30. 

In addition, we compared DTI-RCNN with other deep learning methods, including 
DNN and RNN. The prediction results of the three methods are shown in Table 2. 

From Table 2, the AUC and Precision indicators of the simple RNN model for the 
seven cell lines are better than those of the DNN, indicating that the RNN can well 
learn the potential relationship between gene and drug data. The classification per- 
formance of DTI-RCNN is better than that of RNN, indicating that the CNN can indeed 
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Table 2. Comparison of prediction results of three deep learning algorithms (the results of the 
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algorithm proposed in this paper are rendered in bold type). 


Cell lines DNN RNN DTI-RCNN 
A375 AUC [0.8892 + 0.015 |0.9329 + 0.0165 | 0.9429 + 0.0076 
Precision | 0.8036 + 0.0164 | 0.8775 + 0.0066 | 0.9377 + 0.0145 
A549 AUC [0.891 +0.01 | 0.9202 + 0.0134 0.9371 + 0.0176 
Precision | 0.8339 + 0.0166 | 0.9168 + 0.0068 | 0.9261 + 0.0098 
HAIE AUC [0.8817 + 0.0203 | 0.9116 + 0.0181 | 0.9358 + 0.0149 
Precision | 0.8714 + 0.0105 | 0.9042 + 0.007 | 0.936 + 0.0095 
HCC515 AUC | 0.8812 + 0.0101 | 0.9433 + 0.0138 | 0.9613 + 0.0163 
Precision | 0.8093 + 0.0179 | 0.9325 + 0.0192 | 0.9515 + 0.0128 
HEPG2 AUC | 0.8699 + 0.0185 | 0.9091 + 0.0185 | 0.9249 + 0.0198 
Precision | 0.8405 + 0.0106 | 0.9065 + 0.0026 | 0.9118 + 0.0076 
PC3 AUC |0.9097 + 0.0112 | 0.9326 + 0.0175 | 0.968 + 0.0117 
Precision | 0.846 + 0.0127 | 0.9248 + 0.0135 | 0.9522 + 0.017 
VCAP AUC | 0.9061 + 0.0061 | 0.9328 + 0.0138 | 0.9537 + 0.0047 
Precision | 0.8977 + 0.0119 | 0.9055 + 0.0184 | 0.9163 + 0.0078 
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Fig. 4. Overlap between DTIs predicted by hybrid model and DTIs recorded by CTD database. 


learn the locus information between gene and drug features. The results show that the 
proposed DTI-RCNN is superior to other deep learning models. 


3.3 Prediction of Novel DTIs 


We used DTI-RCNN to predict novel DTIs. Using the predicted DTIs in the PC3 cell 
lines as example, we examined the novel DTIs using the CTD database, which is a 
comprehensive database including chemical-gene interactions [19]. We ranked all 
novel DTIs by predicted score and computed overlapping pairs between the novel DTI 
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predicted by DTI-RCNN and the interactions from the CTD database. Next, we 
counted the number of overlapping pairs in the sliding bins of 1,000 consecutive 
interactions (Fig. 4). In addition, we used the hypergeometric test to investigate the 
statistical significance of the overlap between predicted DTIs and those (P Value = 
1.75 x 19%, The result indicates that DTI-RCNN could indeed discover a certain 
part of novel DTIs validated by known experiments. 


4 Conclusions 


In this work, we proposed a DTIs prediction framework, designated DTI-RCNN, which 
is based on the RNN-CNN hybrid model, and used the drug perturbation transcriptome 
data and gene knockout trails in the L1000 database to train the model. DTI-RCNN can 
learn the associated semantic information between gene and drug data effectively, and 
can make full use of its locus feature to predict the data. The results show that the 
proposed model’s classification performance is superior to that of other deep learning 
methods and has the ability to discovery more reliable DTIs. The data from multiple 
cell lines demonstrate the superiority and robustness of DTI-RCNN. This also suggests 
that our hybrid model can effectively integrate gene and drug transcriptome data and 
effectively shorten the DTIs prediction process within the drug discovery process. 
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Abstract. Emotion cause detection which recognizes the cause of an emotion 
in microblogs is a challenging research issue in Natural Language Processing 
field. In this paper, we propose a hierarchical Convolution Neural Network 
(Hier-CNN) for emotion cause detection. Our Hier-CNN model deals with the 
feature sparse problem through a clause-level encoder, and handles the less 
event-based information problem by a subtweet-level encoder. In the clause- 
level encoder, the representation of a word is augmented with its context. In the 
subtweet-level encoder, the event-based features are extracted in term of 
microblogs. Experimental results show that our model outperforms several 
strong baselines and achieves the state-of-the-art performance. 


Keywords: Hierarchical model > Convolution Neural Network 
Emotion cause detection 


1 Introduction 


Emotions are one of the most fundamental feelings of human experiences, thus emotion 
analysis has great value in a wide range of real-life applications. In the research 
community of Natural Language Processing (NLP), there are mainly two kinds of 
emotion analyses: emotion classification and emotion cause detection. The former 
focuses on the category of an emotion and the latter works on the cause of an emotion. 
In this paper, we work on the emotion cause detection task of Cheng et al. (2017). 

A microblog focuses on an event, and a clause in a microblog often contains only 
some information about the event, so the extraction of event-based features for a clause 
needs to access the focused event in the microblog. In this paper, we propose a 
hierarchical approach which contains two steps (clause-level and subtweet-level) to 
extract event-based features. Given a Chinese microblog, a clause-level encoder 
combines several neural networks to extract local features in each clause. Then, a 
subtweet-level encoder treats those local features as a sequence and then extracts 
sequence features for each clause through Convolution Neural Networks (CNNs; Kim 
2014). Moreover, because of the feature sparse problem in our small-scaled experi- 
mental data, our clause-level encoder extracts two kinds of local features to comple- 
ment each other: salient features from CNN and weighted features from attention 
network. 
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The contributions of this paper are summarized as follows: 


e We propose a hierarchical model to extract event-based features, which uses a 
clause-level encoder to extract rich local features in a clause and then use a 
subtweet-level encoder to extract sequence features of the whole microblog. 

e We propose a context-aware attention encoder to address the feature sparse prob- 
lem, which uses context-based representations of words to learn word weights. 


2 Related Work 


Due to the increasing attention to emotion cause detection recently, there are a few 
emotion cause corpora available. Most of them are manually annotated, either for 
formal texts (Lee et al. 2010; Gui et al. 2016; Xu et al. 2017) or for informal texts (Gui 
et al. 2014; Gao et al. 2015; Cheng et al. 2017). Based on these emotion cause corpora, 
intensive studies have explored the extraction of effective features for two kinds of 
emotion causes: explicit causes which are expressed with explicit connectives (e.g. “to 
cause”, “for”), and implicit causes which are inferred from the given texts. In the 
former case, different linguistic rules are proposed to extract linguistic expression 
patterns using the context of the current clause (Chen er al. 2010; Xu et al. 2017; Ghazi 
et al. 2015). In the latter case, different event-based features which reflect the causal 
relation are examined, such as the convolutional deep memory network (ConvMS- 
Memnet; Gui ef al. 2017), Long Short-Term Memory Network (LSTM; Cheng et al. 
2017) and so on. Because implicit emotion causes play a dominant role in Chinese 
microblogs (Cheng ef al. 2017), we focus on event-based feature extraction for implicit 
emotion cause detection in this paper. 


3 Our Approach 


3.1 Task Definition 


In this paper, we use the emotion cause corpus provided by Cheng er al. (2017) as our 
experimental data, in which emotion causes in Chinese microblogs are manually 
labeled (namely Cheng emotion cause corpus). Moreover, to better explain our work, 
we adopt twitter’s terminology used in Cheng et al. (2017). 

In Cheng emotion cause corpus, a tweet can be considered as a sequence of 
subtweets ordered by their published time. E.g. in Fig. 1, there are five subtweets 
sequentially published by five users (I’m Jay, Desdis Yun, I’m eggette, Little Koala, 
and the owner of the tweet) in the example. Furthermore, given an emotion keyword in 
a subtweet, Cheng er al. (2017) found that the corresponding emotion causes usually 
locate either in the current subtweet or in the original subtweet. Therefore, there are two 
emotion cause detection tasks: current-subtweet-based emotion cause detection and 
original-subtweet-based emotion cause detection. The experimental result of Cheng 
et al. (2017) showed that the current-subtweet-based emotion cause detection task is 
more challenging, and thus we focus on this emotion cause detection task in this paper. 
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Chinese: HR... 101 MR: FIRIAIRTM bing NOK: © 
I@Desdis9: WI HRH IFAT AAN 

DAM OREJay: AREER SABLA ER TE A~ 

English Translation: Oh yeah... //@Little Koala: Salary is increased, salary is in- 


creased--happying //@I’m eggette: & //@Desdis Yun: It is awkward for me who was re- 
signed just now!!!! 


[original subtweet] @I’m Jay: Nothing even exam fails can compare my message hahaha~~ 


Fig. 1. An example of a tweet. 


In order to extract features from the perspective of the whole subtweet, an instance 
is a pair of (X, Y), where the input X consists of an emotion keyword (EmoKW) and a 
sequence of clauses in a subtweet, and the output Y is a sequence of binary labels which 
indicates the causal relation between a clause and the emotion keyword. E.g. in Fig. 1, 
there are two clauses in the current subtweet for “awkward” (the emotion keyword): “It 
is”, and “for me who was resigned just now”. The corresponding labels for the two 
clauses are ‘0’ and ‘1’. Furthermore, in order to provide complemental information to a 
clause, each clause in the input X is attached with a context (i.e. the text between 
EmoKW and the current clause). Finally, the input text of an instance includes an 
EmoKW, a sequence of clauses (ClauseSeq) and a sequence of contexts (ContextSeq). 


3.2 Overview 


Our emotion cause detection approach is based on a neural network which mainly 
includes two components: an encoder which extracts a feature representation and a 
decoder which assigns a label to each clause according to the representation. As shown 
in Fig. 2, a hierarchical CNN encoder is applied to each input sequence (ClauseSeq or 
ContextSeq) and generates a sequence of hierarchical features (Mpier Clause OF 
Nnier_ Context). Then, the final representation of each clause is the concatenation of the 
feature of EmoKW (hgmoxw) and the two hierarchical features separately from 
Nhier_ Clause and hpier_Contexı. In the classification decoder, a linear layer takes the final 
representation as the input, and generates a label with softmax function. 

To better explain the hierarchical CNN encoder in the following section, we assume 
the input sequence is the sequence of clauses ClauseSeq = (C},..., Cr), where C; is the 
i-th clause. As shown in Fig. 2, there are two-level sub-encoders in the hierarchical 
CNN: a clause-level encoder which extracts local features (Mocal_ Context OY Niocal_ Clause) 
for C; based on the words in the clause, and a subtweet-level encoder which extracts the 
hierarchical feature (Anier_Context OT Anier_Clause) for C; based on all local features in the 
subtweet. Each sub-encoder is a combination of several encoder layers. Given an input 
sequence X, an encoder layer yields a middle representation h through Eq. 1. 


h = encoder(X) (1) 
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Fig. 2. Overview of our hierarchical emotion cause detection model. 
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Fig. 3. Illustration of our hierarchical CNN encoder with the clause-level encoder and subtweet- 
level encoder. G is the Gated Linear Unit. 


3.3 The Clause-Level Encoder 


As shown in Fig. 3, the clause-level encoder sequentially uses different kinds of 
encoder layers to extract two local features for C; (i = 1...... T). In order to alleviate the 
feature sparse problem, CNN is used to extract abstractive features over the focused 
clause. In the clause-level CNN, convolutional filters are used to extract high-level 
features from the sequence of words in C; and then in order to further handle the 
feature sparse problem, two ways are used to extract the two local features for C;: a 
max-pooling layer with rectifier linear unit activation function (ReLU; Glorot et al. 
2011) to obtain a local salient feature, and a context-aware attention network which 
learns the weights of words to obtain a local weighted feature. 

In the context-aware attention network, Gated Linear Unit (Dauphin et al. 2017) is 
used to generate a representation of the context of each word and produce a context- 
based representation for the word, and then an attention layer (Ma er al. 2017) is 
applied to obtain a weighted feature for C;. In this attention layer, the weight of the j-th 
word w; (¡=1...... N) in C; is obtained through Eq. 2, where hy, is the representation of 
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word w;, hgmokw is the representation of EmoKW, [;] is the concatenation between 
matrices, W, and v, are the weight matrices. Secondly, the weights are normalized to 
construct a probability distribution over the words (see Eq. 3). Lastly, the local 
weighted feature of C; (i.e. Aan) is a weighted summation over the representations of all 
words in C; (see Eq. 4). 


ej = v,tanh(W, [hemoxw; hw,]) 2) 


ge (3) 
aa exp(e;) 


N 
han = X, Gh (4) 


J 


3.4 The Subtweet-Level Encoder 


Based on all local features in a subtweet, which are either local salient features or local 
weighted features, the subtweet-level encoder uses two CNNs to extract a hierarchical 
feature. Firstly, the local salient features (or the local weighted features) are ordered 
into a sequence according to their corresponding clauses, and then subtweet-level 
CNN; with ReLU is used to extract hierarchical salient features (or hierarchical 
weighted features) over the sequence of local features. Secondly, a clause is represented 
by a set of features: a local salient feature, a local weighted feature, a hierarchical 
salient feature, and a hierarchical weighted feature. The sets of features are ordered into 
a sequence according to their corresponding clauses, and then subtweet-level CNN> 
with ReLU and max-pooling layer are used to extract the final features (hnier_c in 
Fig. 3). 


4 Experiments 


4.1 Experimental Setup 


Datasets and Metrics. As mentioned in Sect. 3.1, Cheng emotion cause corpus is 
used in our experiments, which contains ~ 4,300 instances and ~ 12,600 clauses. We 
use 5-fold cross-validation to evaluate all the methods. Because a subtweet often 
contains several emotion keywords, the instances containing one of the emotion key- 
words have overlaps in their input texts. Therefore, when creating the folds, we ensure 
instances from the same subtweet are not shared between the folds. This is important as 
repeating subtweets in both the train and the test sets could potentially make a model 
performs better than it actually does. Similar to previous work (Cheng et al. 2017; Gui 
et al. 2017), only the precision, recall and Fl-score of label ‘1’ are reported as eval- 
uation metrics. 
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Model Settings and Training Details. The dimension of word vector in our model is 
20; the kernel widths of the clause-level CNN and subtweet-level CNN, are 3, and the 
kernel numbers are 128. The kernel widths of subtweet-level CNN; are 1 and 4, and the 
kernel numbers are both 64. Dropout is set to 0.5 and is only applied to the final 
representation. Adam optimizer (Kingma and Ba 2015) is used to optimize the 
parameters, the learning rate is 0.001, the weight decay is 0.0001, and the batch size is 
20. All the parameters are initialized with Xavier Initialization (Glorot and Bengio 
2010). 


Baselines. We compare our hierarchical CNN approach (Hier-CNN) with the fol- 
lowing baselines which use different approaches to encode an instance, where CNN 
and ConvMS-Memnet use the emotion keyword and the current clause as input, and 
LSTM uses the same input as Cheng er al. (2017) (i.e. local text defined in Sect. 2). 


e CNN: the CNN-based encoder is applied to obtain the representation of local text. 

e LSTM: it is the emotion cause detection approach proposed by Cheng er al. (2017). 

e ConvMS-Memnet: it is the state-of-the-art emotion cause detection approach pro- 
posed by Gui et al. (2017). 


4.2 Method Comparison 


Table 1 shows the performances of different emotion cause detection approaches. From 
Table 1, we observe that our hierarchical CNN approach (Hier-CNN) significantly 
outperforms the three baselines and yields the highest performance. Compared with the 
two state-of-the-art emotion cause detection approaches (LSTM and ConvMS- 
Memnet), our hierarchical CNN encoder chooses a multi-channel structure to sepa- 
rately use three sequences of input words in local text (the emotion keyword, the 
current clause and the context), and uses a hierarchical CNN encoder to effectively 
extract event-based features for the emotion cause detection on Chinese microblogs. 


Table 1. The performances of different methods for the emotion cause detection. 


Encoder Precision | Recall Fl 

CNN 48.2 572 | 52.3 
LSTM 51.5 63.4 56.7 
Convs-Memnet 41.4 61.0 49.2 
MChanCNN 54.0 62.8 58.0 
MChanLSTM 52.9 64.7 58.1 
MChanLSTM-ATT 53.1 61.9 57.1 
MChan Convs-Memnet | 54.7 47.1 | 50.5 
Hier-CNN 52.9 68.8 59.7 
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4.3 Model Analysis 


In this section, we make an in-depth analysis of our hierarchical CNN encoder in terms 
of two lines: the multi-channel structure and the components of our hierarchical CNN 
encoder. 


Multi-channel. We integrate the multi-channel structure with one of the three baseline 
encoder (CNN, LSTM and ConvMS-Memnet), and list their performances in Table 1 
(MChanCNN, MChanLSTM, and MChan ConvMS-Memnet). When the multi-channel 
structure is applied to each baseline encoder, the performance is improved. E.g. the Fl- 
score is increased by 5.7% for CNN, 1.4% for LSTM, and 1.3% for ConvMS-Memnet. 
This indicates that the multi-channel structure can effectively detect the causal relation 
between an emotion and an event through separately using the information in the 
current clause and the complemental information in the context. Moreover, the slight 
improvement for LSTM and ConvMS-Memnet shows that these encoders suffer the 
feature sparse problem in Chinese microblogs. 


Table 2. The detailed performances of our hierarchical model. 


Encoder | Precision | Recall | F1 

Hier-CNN | 52.9 68.8 | 59.7 
R-HF 51.7 65.9 | 57.3 
R-LF 52.8 61.3 | 56.5 
R-WF 52.9 67.6 | 59.1 


Components. In Table 1, although LSTM significantly outperforms CNN (56.7% vs. 
52.3% in Fl-score), the performance difference between MChanCNN and 
MChanLSTM is rather small (58.0% vs. 58.1% in Fl-score). CNN and LSTM have 
different advantages in terms of feature extractions: CNN outperforms in capturing high- 
level features and LSTM is advantageous for capturing sequence features. Moreover, we 
observe that applying attention mechanism to MChanLSTM (MChanLSTM-ATT) does 
not improve the performance (58.1% vs. 57.1% in Fl-score). 

Compared with MChanCNN and MChanLSTM, Hier-CNN achieves the best 
performance (59.7% in Fl-score). This indicates that the hierarchical CNN encoder can 
effectively integrate the clause-level information and subtweet-level information. 
Moreover, in terms of attention mechanism, Hier-CNN significantly outperforms the 
MChanLSTM-ATT (59.7% vs. 57.1 in Fl-score). This indicates that Hier-CNN can 
better capture the key information of a clause. 

In order to investigate the effect of local salient features (SF), local weighted 
features (WF) and hierarchical features (HF), we build another three classifiers listed in 
Table 2, where R-HF, R-FL and R-WF are the Hier-CNN whose HF, LF and WF are 
removed respectively. As shown in Table 2, if LF is removed, the recall drops sig- 
nificantly, which directly pulls down the overall performance. Moreover, if WF is 
removed, the recall drops slightly. This indicates that combining LF and WF, the 
feature sparse problem can be effectively alleviated. Furthermore, it can be observed 
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that, after removing the HF, the overall performance degrades. This indicates that the 
subtweet-level information of a clause can effectively augment event-based features 
from local clauses, and thus improve the performances. 


5 Conclusion 


In this paper, in order to extract more event-based features for emotion cause detection 
on Chinese microblogs, we propose a hierarchical CNN approach, which extract the 
rich local features using the clause-level encoder and more event-based features using 
the subtweet-level encoder. We show that our hierarchical CNN approach can effec- 
tively utilize information in a subtweet for emotion cause detection. 
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Abstract. Accurate uncertainty predictions are crucial to assess the 
reliability of a model, especially for neural networks. Part of this uncer- 
tainty is the observation noise, which is dynamic in our marine virtual 
sensor task. Typically, dynamic noise is not trained directly, but approx- 
imated through terms in the loss function. Unfortunately, this noise loss 
function needs to be scaled by a trade-off-parameter to achieve accurate 
uncertainties. In this paper we propose an upgrade to the existing archi- 
tecture, which increases interpretability and introduces a novel direct 
training procedure for dynamic noise modelling. To that end, we train 
the point prediction model and the noise model separately. We present a 
new loss function that requires Monte Carlo runs of the model to directly 
train for the uncertainty prediction accuracy. In an experimental eval- 
uation, we show that in most tested cases the uncertainty prediction 
is more accurate than the manually tuned trade-off-parameter. Because 
of the architectural changes we are able to analyze the importance of 
individual parts of the time series of our prediction. 


Keywords: CNN - LSTM - Predictive uncertainty - Time series 


1 Introduction 


Recent research proposed the combination of dropout and Monte Carlo (MC) 
runs to approximate the predictive uncertainty for regression and classifica- 
tion tasks [3,4]. Instead of predicting a single point, the model expresses its 
uncertainty through intervals. This is particularly useful for tasks that want to 
evaluate the prediction in terms of reliability and robustness, e.g. mixing the 
measured and predicted uncertainty state to control a robot [13]. We apply 
this predictive uncertainty method to the marine virtual sensor task based on 
the combined Biodiversity-Ecosystem Functioning across marine and terrestrial 
ecosystems (BEFmate) [2] and the Time Series Station Spiekeroog (TSS) [1] 
real-world dataset [14]. The goal is to replace a real sensor that failed due to the 
harsh environmental conditions in the Wadden sea, such as the daily tidal forces, 
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salt water exposure, and occasional storms. This replacement sensor is virtual 
and represents a nowcasting task in which current values of different origin are 
used as input to predict the current target value. In our case, surrounding sen- 
sors are used to model a missing sensor at the same time step. For comparison, 
forecasting tasks predicts future target values based on current values, e.g. room 
temperature forecasts [16]. 

Previous work introduces the MarineNet architecture [14], which combines 
convolutional as well as recurrent layers, incorporates input quality information, 
and employs the above mentioned uncertainty prediction method. It assumes 
heteroscedastic, or dynamic uncertainty in the observations, which is reflected by 
varying noise in the data. The original method [3, 14] trains this observation noise 
through approximation by tuning a hyper-parameter that cannot be learned 
directly. Moreover, MarineNet applies a unique time dimensionality reduction 
approach, exPAA, which splits a time series into parts that aggregate different 
amounts of time steps. An importance analysis of these exPAA parts for the 
final prediction is difficult, but could be useful for the prediction. 

In this work, we propose to address the shortcomings of MarineNet with: 


1. an architectural upgrade, allowing to analyze exPAA parts and 
2. a novel training procedure to directly learn the dynamic observation noise. 


The first contribution is achieved by replacing the last fully connected (dense) 
layer with a convolutional layer followed by averaging over the time series and 
more residual connections. We also adjusted the number of neurons of individual 
layers and finally require less weights to achieve similar performance. The second 
contribution is attained by separating prediction and observation noise training. 
We introduce a new loss function for the noise training that directly compares the 
predicted and the actual uncertainty of the model. In an experimental evaluation, 
we achieve equal or better performance with the proposed changes and are able 
to analyze the exPAA parts. 

The paper is structured as follows. Section 2 introduces the MarineNet archi- 
tecture with the most relevant mythological concepts. In Sect. 3 we describe our 
upgrade to the architecture as well as the new direct training of the observation 
noise. These upgrades are evaluated in Sect.4. Finally, we draw conclusions in 
Sect. 5. 


2 Original MarineNet 


The MarineNet is a neural network architecture utilizing multiple concepts [14]. 
A macroarchitectural overview is presented in the upper part of Fig. 1. Con- 
volutional layers filter the time series to create useful temporal features with 
kernel sizes of one and three (convl and conv3). These are grouped into four 
fire modules from SqueezeNet [10]. Then, the exPAA layer [15] follows, which 
creates ö’ parts from 6 time steps, whereby the number of time steps per part is 
decreasing over time, depending on a hyper-parameter exponent e. Consequently, 
earlier parts aggregate more time steps, while more information are retained in 
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later steps. Next, the biLSTM layer [6,8] is fed the aggregated time series, which 
is then processed by a dense layer. Finally, the sensor output and the dynamic 
noise is predicted in final linear regression layers. 

The dropout mechanism [5,17], where multiple neurons are deactivated for 
one iteration, is employed before each trainable layer. It acts as a regularizer and 
helps to avoid overfitting. Batch normalization [11] is applied after the activation 
function and if applicable after dropout to further reduce overfitting and to speed 
up convergence. 

Another important part is the implementation of predictive uncertainty via 
MC dropout inspired by Gal [3,4]. Predictive uncertainty is the confidence of 
our model about its current prediction and consists of two parts. First, the data 
uncertainty, which is reflected in the training distribution, e.g. predictions are 
unreliable if an unseen sample is on the far end of the training distribution 
or the available data is noisy. Second, the model uncertainty that affects the 
internal structure and expression of weights. For example, if a model weight is 
greater for one or another input and thus give it more importance. The predictive 
uncertainty can be expressed as an interval around the point prediction. To 
create this interval, multiple forward passes of MarineNet are calculated with 
different dropout realizations. These MC dropouts are conducted at test time 
and give two outputs, a predictive mean Ely] with variance Var[y;] of m MC 
model runs f; € F: 


wl he) 


Tartu] = LY gos) + files)® - Elu. (1) 


i=1 


With a higher number m of MC runs, the approximation is stabilizing. The 
standard uncertainty interval is represented by squaring the predictive variance 
(c.g. Var|yı]? is the 68.27% uncertainty interval). 

The observation noise g is modeled dynamically, because in the employed 
marine application varying noise is introduced, inter alia, by tides and seasons. 
This noise is equal to the inversion of the models’ precision and represents a 
function g(a), which is part of the loss function during training: 


L:= a: (y f(a)” (ga) + 1) — (1 — a) -log(g(z+)), (2) 


with the trade-off variable a € [0,1] to calibrate the uncertainty scale. Since 
the noise is not allowed to be smaller than or equal to zero, softplus is used as 
activation function. 

Lastly, the qDrop layer [14] adapts the dropout chance per input dimension 
after the input layer depending on the current time step and sensor quality. The 
sensor quality results from the number of consecutively imputed values, since the 
imputation quality decreases with the length of the data gap. This has direct 
impact on the uncertainty predictions at test time. For example, when we drop 
some of the inputs due to low quality, we increase the uncertainty if the dropped 
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Fig. 1. The macroarchitectural view of MarineNet (top) and UMarineNet (down). 


features are important for the prediction. During the training phase, less reliable 
features are automatically dropped more often based on their quality and thus 
the network learns to favor trustworthy features to a greater extent. 


3 MarineNet Upgrade 


We found two shortcomings of MarineNet. First, there is no easy way to analyze 
the importance of individual exPAA parts. If these information were available, 
the time series aggregation could be adapted to focus on the more crucial time 
steps. Second, the scaling of the observation noise greatly depends on the hand- 
tuned parameter a in Eq.2. To address these shortcomings, we updated the 
architecture to return explainable time step impact. Further, we change the 
training process of MarineNet to acquire accurate uncertainty predictions with- 
out calibration of a. 


3.1 Changes to the Architecture 


The architectural changes to MarineNet are shown on the lower part of Fig. 1. 
As a first change, we substitute the only dense layer by a convolutional layer 
with kernel size of one (convl layer), followed by an averaging of the outputs 
over the steps, but not the neurons. We drew inspiration for this change from 
multiple publications [7, 10,12], who apply this technique to images instead of 
time series data. Instead of returning only the last time step output, the bLSTM 
layer now passes on its complete output over all time steps. This was avoided 
in MarineNet, because the dense layer would have needed significantly more 
weights (number of neurons times exPAA parts). Since the convl kernel is not 
tied to the length of the input series, the computational cost did not increase 
substantially with the complete bLSTM output. Further, the output from this 
convl layer offers insight into which parts are most important, as only averaging 
and linear combinations are employed afterwards. 

Because of the change to the bLSTM output, it is now possible to add more 
residual connections [7]. We create compatibility between the outputs by apply- 
ing exPAA to acquire the same time resolution and a convl layer to adjust 
for differences in neuron count. More residual connections are added inside the 
fire-modules, after the single convl layers. Another change is that each residual 
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connection also uses all compatible residual connections before them. ThTese 
kind of dense residuals are introduced by Zhang et al. [18] and Huang et al. [9]. 

Through changes to the number of neurons and layer compositions, our 
UMarineNet requires 3.04 times less weights, which amounts to 376188 com- 
pared to previously 1145072 weights. We increased the number of neurons for 
the first convl layer in fire modules from 48 to 64. In the bLSTM layer, now 
192 instead of 512 neurons are employed. The convl layer that replaces the 
dense layer keeps its 512 neurons, but the weight matrix shrinks because of the 
smaller input from the bLSTM layer. All normal dropout layers utilize a 50% 
keep chance. 


3.2 Automatic Training of Accurate Uncertainty Predictions 


The loss function in Eq.2 employs two counteracting mechanisms to learn the 
model noise: scaling the original error, which minimizes for small values and the 
negative logarithm of this noise that minimizes for large values. Depending on 
the scaling of the target variable and underlying processes, the negative loga- 
rithm can be a poor choice to train the noise. The trade-off parameter a partly 
mitigates this effect, but needs to be tuned separately. We are not optimizing 
directly for the uncertainty, since it would require the MC prediction during 
training, which is computationally costly at training time with the complete 
network. 

We propose to completely remove this hyper-parameter a by altering the 
training process to directly learn the accurate noise function. In the beginning, 
we ignore the dynamic noise function g and train UMarineNet to create accurate 
point predictions f by minimizing MSE loss. Thereafter, the optimizer is not 
allowed to change the weights of the network anymore, it is frozen. Only the 
linear layer of the noise function is not frozen. This layer is then minimizing the 
following loss function: 


Lime 2- max ( AC) © (iog ~ ecel a ‚0 ) 
ae i i A 2 
+ > i=52 | A 100) j (= acc( 10)) | (3) 


2 
+ 2- max ( A 100) ' (on ACG 100) > o) 


with actual accuracy acc(j) and A(j) being the difference between the absolute 
prediction error and the uncertainty interval at percent accuracy 7: 


igs ( jue — vel y -V2-erf 0») | (4) 


with predictive mean Ely], predictive variance Var[y], batch size n, and inverse 
Gauss error function erf”*. This actual accuracy at the desired accuracy 7 € 
(0,1) over ñ samples is calculated by: 


128 S. Oehmcke et al. 


aef) = Z Y (1w - Elua < Warm VE) 


t=1 


with the logical operator < returning 0 for false and 1 for true. This loss function 
requires multiple MC forward passes through the network during one iteration 
to acquire the predictive mean Ely] and variance Var[y], but due to the frozen 
layers, only the gradient for the noise layer has to be computed. We only update 
the weights to optimize for the noise g in only one of the MC runs. This avoids too 
much change to the weights in one iteration and saves on computing resources 
by calculating the gradient only once. 

By utilizing multiple A-function calls, we train the noise function g to con- 
verge between these desired accuracy levels between 51% and 99%. The scaling 
of these A calls by the difference between the desired and the actual accuracy, 
helps the convergence of the noise model. Ideally, one would only optimize for 
this difference, but because the logical operator < is not differentiable. Conse- 
quently, this term only acts as a fixed value. 

We define the first and third row of Eq. 3 as outer bounds. They only increase 
their loss value if they fall below or exceed their desired accuracy of either 51% 
or 99%. Since these bounds are critical for our uncertainty prediction, they are 
doubled. Further, the second row of Eq.3 can be seen as support points for the 
actual accuracy to reach the desired accuracy. 

The separation of learning prediction and noise can also be seen as a network 
for noise on top of a prediction network, enabling already trained networks to 
acquire reliable noise observations afterwards. Also, more complex layer struc- 
tures could be employed if the noise seems to be a non-linear process. 


4 Experimental Evaluation 


The following experiments verify that the changes to the architecture can give 
insight to the importance of individual parts of the input and that the direct 
learning of the noise function is at least as good as tuning the trade-off param- 
eter & beforehand. We compare the results of the original MarineNet and the 
UMarineNet with and without direct training of the observation noise. 


4.1 Combined TSS and BEFmate Dataset 


The training set cover the time from 2014-09-18 15:00 to 2015-03-31 22:40:00 
in a 10-minute resolution, which amounts to 49867 time steps of 57 different 
sensors by the TSS and BEF mate project [14,15]. Since the target sensor mostly 
measured at high tide, when the sensor is in the water, only 11633 target sensor 
time steps are available for the same time frame. We employ a 60-40% train- 
ing/testing split. For training, 6979 steps of the target sensors and 24922 steps 
of the surrounding sensors are available. To utilize the surplus time steps from 
the surrounding sensor, we append up to 24h (144 steps) of data to each target 
input step. We optimize the hyper-parameters by dividing the training set into 
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a 70/30%-split for training/validation. The complete training set is used after 
the hyper-parameter optimization. Table 1 shows the hyper-parameter settings 
for exPAA’s original steps 6, reduced parts 6’, and exponent e as well as qDrop’s 
exponent £ value of UMarineNet. The remaining 40%, 4654 target sensor steps 
and 19946 surrounding sensor steps of the dataset represent the test set. Just 
as the original MarineNet, we create a model for each of the five target sensors, 
which are: Speed, Temp, Conductivity, Pressure, and Direction. 


Table 1. Choice of optimized hyper-parameter settings for the UMarineNet. 


sensor #steps 6 | #parts 6’ | exponent e | quality exp.e 
Speed 72 4 2.0 0.25000 
Temp 36 4 2.0 0.06250 
Conductivity | 18 8 2.0 0.03125 
Pressure 36 8 1.5 0.25000 
Direction 72 8 1.5 0.25000 


4.2 Methodology 


We measure the performance with three metrics. First, we employ the root mean 
squared error (RMSE) for the point prediction performance: 


n 


RMSE := “Sy — Ely:]))?, (6) 


t=1 


with predictive mean Ely;], true target value y;, and number of samples n. 
Second, we calculate the mean standard uncertainty interval (SUI): 


sur =~ X Vary, (7) 


with the predictive variance Var[yı] from Eq. 1. Since the SUI has no meaning 
w.r.t. the actual achieved accuracy of the uncertainty prediction, we use the 
Brier score: 1 

Brier score := — 5 (acc(i) — i)”, (8) 

ltl te 

with the actual mean accuracy acc(j) from Eq.5, which defines the percent- 
age of values that should fall within the Gauss distribution of errors (actual 
against desired accuracy). The examined desired accuracy percentages are: 
i = (.55, .6, .65, .7, .75, .8, .85, .9,.95,.999). Only accuracies over 50% are rele- 
vant to us. A 100% accuracy of predictive uncertainty would be meaningless 
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since a large or infinite interval always includes the true target. All metrics indi- 
cate a better performance when they are lower, although a smaller SUI but 
greater Brier score would indicate a too small SUI that does not fit the desired 
uncertainty accuracy. 

We employ the one-sided Mann-Whitney U statistical test. If it returns a 
p-value below 0.05 and the U value is below or equal the critical value, we call 
the difference significant. The critical U value is 800 for 40 runs of MarineNet 
and UMarineNet. 
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Fig. 2. Comparing performance of MarineNet and UMarineNet without (w/o) and 
with direct observation noise training through box plots. The columns show the target 
sensors and the rows different quality measurements. A red box stands for a significantly 
lower value in comparison to the MarineNet. (Color figure online) 


4.3 Results 


Figure2 shows box plots that compare the best runs of MarineNet to 
UMarineNet with and without direct observation noise training. Rows depict 
the performance metrics and columns the target sensors. The RMSE improves 
for all target sensors, except Speed for UMarineNet with direct noise training. 
The architecture change alone only improved results for Conductivity, Pressure, 
and Direction. UMarineNet with direct noise training performs significantly bet- 
ter regarding the Brier score for Speed, Temp, Conductivity, and Pressure, but 
there is no distinctable difference for Direction. Without direct noise training, 
UMarineNet improves the Brier score for Conductivity, Pressure, and Direction, 
but is worse for Speed and Temp. A notable SUI change is seen for Pressure, 
where the interval is smaller, although the uncertainty accuracy is better. In 
summary, the performance of the target sensor models improved in most cases 
with UMarineNet, especially when direct noise training is applied. 
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the proposed system than in the conventional system. Moreover, it can be seen 
that the proposed system has higher accuracy than the system using the general 
convolutional neural network. 


Table 2. Search accuracy (for evaluation data) 


Recall Precision | F-value |MAP 
Proposed system | 0.585222 | 0.545125 | 0.564462 | 0.657010 
CNN 0.551333 | 0.329679 | 0.412623 | — 
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Fig. 8. Transition of error function 


3.3 Transition of Classification Accuracy and Error Function 


Here, we examined how the classification accuracy and error function for learning 
data and evaluation data changes in the learning process of the proposed system 
and the system using a general convolutional neural network. 
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The transition of classification accuracy for learning data and evaluation data 
in each system is shown in Fig. 7. In Fig. 7, it can be seen that in the proposed 
system, the classification accuracy for the evaluation data varies almost in same 
way as the classification accuracy for the learning data. On the other hand, in the 
system using the general convolutional neural network, the classification accu- 
racy for the learning data increases as learning progresses, but the classification 
accuracy for the evaluation data becomes almost flat after 28 epochs. From this 
result, it can be see that there is no generalization ability in the network after 
learning in the system using the general convolutional neural network. 

The transition of error function for learning data and evaluation data in each 
system is shown in Fig. 8. In Fig. 8, it can be seen that in the proposed system, 
the error function for the evaluation data varies almost in same way as the 
error function for the learning data. On the other hand, in the system using the 
general convolutional neural network, the error for the learning data decreases 
as learning progresses but the error for the evaluation data increases gradually 
after 29. 

From these results, it can be seen that in a system using a general convolu- 
tional neural network, learning is performed so that it can classify the learning 
data correctly. However, in this system, it is considered that features common 
to images to be classified in the same group can not be extracted. In the con- 
volutional neural network, learning is performed paying attention to the shape 
information included in the image. However, in classification considering touch 
similarity, images with similar shape information are not necessarily treated as 
the same group. Therefore, it can be considered that it could not be classified cor- 
rectly for unlearned data. On the other hand, the proposed system uses not only 
saturation and value but also histogram of saturation and value as input. In the 
conventional convolutional neural networks, it is rare to use features extracted 
in advance as inputs. However, the features common to the group are learned 
by using the histogram of saturation and value, as a result, the proposed system 
can realize search with high accuracy. 


4 Conclusions 


In this paper, we have proposed the artwork retrieval based on similarity of touch 
using convolutional neural network. In the proposed system, a convolutional 
neural network is learned so that images can be classified into a group based on 
a touch, with saturation and value and the histogram of saturation and value as 
input data, and the trained network is used to realize the retrieval. We carried 
out a series of computer experiments and confirmed that the proposed system 
can realize artwork retrieval based on similarity of touch with higher accuracy 
than the conventional system. 
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Abstract. Depth perception through stereo vision is an important fea- 
ture of biological and artificial vision systems. While biological systems 
can compute disparities effortlessly, it requires intensive processing for 
artificial vision systems. The computing complexity resides in solving 
the correspondence problem — finding matching pairs of points in the 
two eyes. Inspired by the retina, event-based vision sensors allow a new 
constraint to solve the correspondence problem: time. Relying on pre- 
cise spike-time, spiking neural networks can take advantage of this con- 
straint. However, disparities can only be computed from dynamic envi- 
ronments since event-based vision sensors only report local changes in 
light intensity. In this paper, we show how microsaccadic eye movements 
can be used to compute disparities from static environments. To this 
end, we built a robotic head supporting two Dynamic Vision Sensors 
(DVS) capable of independent panning and simultaneous tilting. We eval- 
uate the method on both static and dynamic scenes perceived through 
microsaccades. This paper demonstrates the complementarity of event- 
based vision sensors and active perception leading to more biologically 
inspired robots. 


Keywords: Spiking neural networks - Event-based stereo vision 
Eye movements 


1 Introduction 


Depth perception is an essential feature of biological and artificial vision systems. 
Stereopsis (or stereo vision) refers to the process of extracting depth information 
from both eyes. The human eyes are shifted laterally, that is why each eye forms 
a slightly different image from the world. The brain is capable of matching a 
point in one image with its corresponding point in the other image, measuring 
its relative distance on the retina and using this value to estimate the distance 
of the object to the viewer. The relative difference of the projections of the same 
object on the two retinas is called disparity. 
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While stereo vision is realised unconsciously and effortlessly in biology, it 
requires intensive processing for artificial vision systems. The core problem of 
stereo vision systems is the well-known correspondence problem: finding matches 
between visual information perceived by the two sensors. A matched pair of pixels 
enables the precise calculation of the depth using the geometry of the camera 
setup and the disparity of the pixels on the epipolar line [1]. As the complexity 
of the scenery increases and noise is added to the images, the computational 
expense of common machine vision system increases significantly, affecting the 
speed, size, and efficiency of the used hardware [19]. 

Advances in neuromorphic engineering enable new approaches for stereo 
vision systems. The use of a Dynamic Vision Sensor (DVS, or silicon retina) [11] 
adds another constraint to the already existing spatial constraints for match- 
ing: time. Unlike conventional cameras which operate with frame-based images, 
a DVS emits independent pixel events at precise time on local light intensity 
changes. This leads to a continuous stream of events, well suited for processing 
with spiking neural networks. Spiking neural networks are referred to as the 
third generation of artificial networks [12]. Unlike their non-spiking counterpart, 
neurons are defined with dynamical systems in continuous time and not on a 
discrete time basis. Communication in spiking neural networks is asynchronous 
and is based on instantaneous spikes. While the form of the spike does not hold 
any specific information, it is the number and timing of spikes that matter [7]. 
Even though it is possible to simulate spiking networks on conventional com- 
puters, their real potential with respect to speed and efficiency is unveiled when 
processed on neuromorphic hardware [5, 19]. 

Recently, approaches have been proposed for disparity computation on event 
streams with spiking neural networks [3,19], both based on groundwork in [13]. 
These approaches are discussed in Sect. 2. They consist of a three-dimensional 
spiking network where output neurons describe one unique point in the observed 
3D-space (see Fig. 1). In other words, an output neuron emits a spike when 
location in 3D-space becomes occupied or unoccupied. In this paper, we show 
how the method can be used to perceive depth from motionless static scenes 
through microsaccadic eye movements. To this end, we built a robotic head for 
the humanoid robot HoLLiE [9] supporting two DVS capable of independent pan- 
ning and simultaneous tilting, see Fig. la. Our results suggest that synchronous 
microsaccadic eye movements in both eyes could be used in biology for stere- 
opsis. While the role of fixational eye movements is not fully understood, their 
importance in perception was already suggested in [10,16,21]. Additionally, our 
network is implemented in PyNN [2] and can run both on SpiNNaker [6] or 
classical CPU with the NEST simulator [8]. 


2 Related Work 


In this Section, we present Poggio and Marr’s cooperative algorithm for stereo 
matching, which was published in 1982 [13] and forms the foundation for fur- 
ther work in the field. The method has recently been improved in [3] with the 
introduction of small computational units, so-called micro-ensembles. 
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According to [14], three steps are involved in measuring stereo disparity. In 
the first step (S1) a point of interest is selected from a surface in one image. In 
step two (S2) the same point has to be identified in the other image. Step three 
(S3) measures the disparity between the two corresponding image points, which 
can be used to calculate the distance of the object to the viewer. However, false 
targets make it difficult to find a matching pair of points. Physical properties 
of rigid bodies are used to minimize the number of false matches. One of these 
properties is that a point on the surface of an object has a unique position at 
a given point in time (Pl). The second physical restraint that can be used is 
the fact that surfaces of objects are perceived as smooth from the perspective of 
the observer. Small changes in topology such as roughness or protrusions are of 
minor importance for the estimation of distance (P2) [13]. To minimize the pos- 
sibility of a mismatch, the physical constraints Pl and P2 can be rewritten into 
matching constraints (C). These matching constraints implement rules of com- 
munication between disparity-sensitive neurons (see Fig. 1b). Derived from P1, 
the uniqueness constraint (C1) states, that for every given point seen by one area 
of one eye, at a specific time, there can be at most one corresponding match in 
the other. Therefore C1 inhibits communication in horizontal and vertical direc- 
tions between the disparity-sensitive neurons. The physical restraint P2 results 
in the continuity constraint (C2), which is based on the assumption that physical 
matter is cohesive and generally has a smooth surface. It encourages communica- 
tions along the diagonal lines of constant disparity. The compatibility constraint 
(C3) states that “black dots can only match black dots” [14]. Recently, this 
method was improved in [3] with the addition of micro-ensembles. The structure 
of a micro-ensemble consists of two blocker neurons and one collector neuron 
(see Fig. 1c). Micro-ensembles prevent a high frequency stimulation of a single 
DVS pixel to trigger false matches by exceeding the collector’s threshold value, 
if a corresponding pixel in the other DVS is not active. The micro-ensemble, 
therefore, emulates an AND-Gate behavior to ensure that only signals received 
by both sensors can trigger a match [3]. 

The structure of our spiking network computing disparity from stereo event- 
based vision sensors is based on [3,19]. The network consists of a three dimen- 
sional grid of disparity-sensitive neurons (see Fig. la). Each of these disparity- 
sensitive neurons describe one unique point in the observed 3D-space, relative 
to the common fixation point of the cameras [20]. For each disparity neuron, 
a micro-ensemble ensures hetero-lateral matching. If the timing of the events 
projected by the retinal pixels into the neural ensemble is temporally congruent, 
the signal reaches the disparity-sensitive neuron. However, if the temporal offset 
of the incoming signals between the left and right pixels is too large, the blockers 
prevent the activation of the disparity-sensitive neuron. The C3 constraint (com- 
patibility constraint) could be implemented by separating ON and OFF events 
in two separate pathways. As this would double the number of neurons, the C3 
contraint is often ignored so that ON and OFF events can match each other. 
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Fig. 1. Structure of the stereo network for detecting all possible positive disparities 
(schemas inspired from [3]). Triangular red edges denote excitatory synapses, rounded 
green edges denote inhibitory synapses. (a): Three-dimensional structure of the stereo 
network. Address events from the two DVS belonging to the same epipolar plane are 
fed to the corresponding epipolar layer in the network. (b): Organization of micro- 
ensembles within an epipolar layer. Each pixel connects to micro-ensembles defining 
a line of sight. The micro-ensembles are connected to each other with respect to the 
constraints mentioned in Sect. 2. For clarity, only the outgoing connections of a single 
micro-ensemble are drawn. The number of micro-ensembles can be reduced by bound- 
ing the minimum and maximum detectable disparities. (c): Schematic representation 
of a neural micro-ensemble. The two blue neurons on the left and bottom of the micro- 
ensemble are the blockers, while the red neuron in the middle is the disparity-sensitive 
collector neuron. Micro-ensembles are connected to each other by their collector 
neurons. (Color figure online) 


3 Evaluation 


In this paper, we rely on the spiking network structure presented in [3], see 
Sect. 2. The contribution of this paper is to enable the method to extract dis- 
parities from static scenes through microsaccadic eye movements by mounting 
the sensors on a robotic head. In this Section, we evaluate our approach on 
real world scenes with the built robotic head. Experiments are realized both on 
static scenes perceived with microsaccadic eye movements and dynamic scenes. 
The scenes are recorded with the two DVS with ROS using the driver from [17]. 


3.1 Micro-saccades on the Robotics Head 


Pan-tilt units have already been used to convert image datasets to event-based 
datasets through microsaccadic eye movements [18]. In this paper, we present our 
robotic head platform for the humanoid robot HoLLiE [9], reproducing stereo 
eye movements. The head consists of three degrees of freedom: tilting both eyes 
simultaneously with a Dynamixel MX-64 servo, and panning the two eyes inde- 
pendently with two Dynamixel MX-28. The rotations are centered around the 
focal point the of the two DVS. The robotic head has a total width of 253 mm 


248 J. Kaiser et al. 


(b) (e) 


Fig. 2. Experiment setup for disparity computation on a static scene perceived with 
microsaccades. (a): Overview of the setup. The DVS head is laid on a table outdoors 
with two objects (a ball and a thermos flask) and both DVS look parallel towards them. 
(b): Accumulated events in the right DVS after an horizontal microsaccade (panning). 
Vertical edges have a high response. (c): Accumulated events in the right DVS after a 
vertical microsaccade (tilting). Horizontal edges have a high response. 
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Fig. 3. Output of the stereo network for the static scene experiment perceived through 
microsaccades. (a): Rendering of the computed disparities during panning. (b): His- 
togram of the computed disparities during panning. Note the peaks around disparity 7 
corresponding to the vertical garage door, around 20 for the thermos flask and around 
29 for the ball (see Fig. 2a). (c): Rendering of the computed disparities during tilting. 
(d): Histogram of the computed disparities during tilting. Less events have been gen- 
erated compared to horizontal microsaccades because of the verticality of the scene, 
leading to fewer disparity detections (see Fig. 2c). 
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Fig. 4. Experiment data for disparity computation on a dynamic scene perceived with 
microsaccades. The same scene as in the first experiment (Fig. 2a) is used, with the 
addition of a person walking in the back. (a): Accumulated events in the right DVS 
for the whole duration of the experiment (1.4s). Only a horizontal microsaccade is 
performed, while a person walks in the back. (b): Corresponding event histograms 
for the left and right DVS. The two peaks denote the positive and negative panning 
(go and return to position). The constant activity reflects the person walking in the 
back. 


and an interpupillary distance of 188mm (IPD). The average IPD of a human 
is around 63mm [4]. Microsaccades are effectuated by slight panning or tilt- 
ing motion with both DVS at very high speed. The robotic head is depicted in 
Fig. la. 


3.2 Static Scenes Perceived Through Microsaccades 


In this experiment, the robotic head is laid on a table outdoors and observes 
two objects (a ball and a thermos flask) at different depths, see Fig.2a. The 
head performs an horizontal microsaccade followed by a vertical microsaccade 
of around 2.8°. 

The network manages to compute the disparity of the different objects in the 
scene with an horizontal microsaccade (Fig. 3b), including the garage door in 
the background. Because most contrast lines in the scene are vertical, the tilting 
microsaccade does not trigger many events, leading to few disparity detections 
(Fig. 3d). Additionally, extracting disparity of horizontal edges is harder for the 
network, because many events will share the same epipolar layer (see Fig. 1b). 


3.3 Dynamic Scenes Perceived Through Microsaccades 


In this experiment, we evaluate whether the method can extract disparities of 
dynamic objects and static objects simultaneously with microsaccades. We rely 
on the same setup as for the previous experiment (Fig. 2a), with an additional 
person walking in the back of the scene. The generated address events are visu- 
alized in Fig. 4. 
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Fig. 5. Output of the stereo network for the dynamic scene experiment perceived 
through microsaccades. (a): Rendering of the computed disparities at t=0.1s dur- 
ing panning while a person walks in the back. (b): Histogram showing the number of 
detected disparities with respect to time. As expected, the number of detections corre- 
lates with the number of address events — see Fig. 4b. (c): Rendering of the computed 
disparities at t=0.3s when no microsaccades are performed. (d): Histogram of the 
computed disparities for the whole sequence. The events corresponding to the walking 
person have a disparity around 10, between the garage door and the thermos flask. 
Compared to the purely static scene, less events were generated for the garage door as 
it is occluded by the human, see Fig. 3b. 


As can be seen in Fig. 5, the network manages to compute the disparity of 
the different objects in the scene as well as of the walking person. 


4 Conclusion 


Depth perception through stereo vision is an important feature for many bio- 
logical and artificial systems. While biological systems can compute disparities 
effortlessly, it requires intensive processing for artificial vision systems. Recently, 
spiking network models were introduced in [3,19], both based on groundwork 
in [13]. Relying on event-based vision sensors, these models take advantage of a 
new constraint to solve the correspondence problem: time. 

Since event-based vision sensors such as the DVS only report changes in light 
intensity, these methods could only extract disparities from dynamic scenes. In 
this paper, we show how synchronous microsaccadic eye movements enable such 
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network to extract disparities out of static scenes. To this end, a robotic head 
platform for the humanoid robot HoLLiE [9] capable of simultaneous tilting 
and independent panning was built. As the retina also adapts rapidly to non- 
changing stimulus [15,21], it is likely that biology also relies on fixational eye 
movements to perceive depth in static scenes. 

For future work, the robotic head could implement other types of eye move- 
ments such as saccades and smooth pursuit. Additionally, one could reduce 
greatly the number of required neurons with hard bounds on minimum and 
maximum detectable disparities. In this setup, active vision could be used to 
squint the eyes to the relevant baseline depth. 
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Abstract. In real world scenarios, objects are often partially occluded. 
This requires a robustness for object recognition against these perturba- 
tions. Convolutional networks have shown good performances in classifi- 
cation tasks. The learned convolutional filters seem similar to receptive 
fields of simple cells found in the primary visual cortex. Alternatively, 
spiking neural networks are more biological plausible. We developed a 
two layer spiking network, trained on natural scenes with a biologically 
plausible learning rule. It is compared to two deep convolutional neu- 
ral networks using a classification task of stepwise pixel erasement on 
MNIST. In comparison to these networks the spiking approach achieves 
good accuracy and robustness. 
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1 Introduction 


Deep convolutional neural networks (DCNN) have shown outstanding perfor- 
mances on different object recognition tasks [10,11,19], like handwritten digits 
(MNIST [5]) or the ImageNet challenge [18]. Previous studies show that fil- 
ters of a DCNN, trained on images, are similar to receptive fields of simple 
cells in the primary visual cortex of primates [21,24] and thus have been sug- 
gested, to a certain degree, as a model of human vision, despite the fact that 
back-propagation algorithm, does not seem to be biological plausible [14,16]. 
Alternatively, many models have been published in the field of computational 
neuroscience, whose unsupervised learning is based on occurrence of pre- and 
postsynaptic spikes. A previous work [16] presented a model using spike-timing- 
dependent plasticity (STDP) rule to recognize digits of the MNIST dataset. We 
propose a STDP network with biologically motivated STDP learning rules for 
excitatory and inhibitory synapses to better mirror the structure in the visual 
cortex. We use a voltage based learning rule from Clopath et al. [7] for exci- 
tatory synapses and a symmetric inhibitory learning rule from Vogels et al. [9] 
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for inhibitory synapses. During learning, we present natural scenes to our net- 
work [4]. The thereby emerging receptive fields [7] are similar to those of simple 
cells in the primary visual cortex [1,2]. After learning, we present digits of the 
MNIST data set to the network and measure the activity of the excitatory popu- 
lation. The activity vectors on the training set are used to train a support vector 
machine (SVM) with a linear kernel to be used on the test set to estimate the 
accuracy of our neural network. 

We previously evaluated robustness of classification by a gradual erasement 
of pixels in the MNIST data set. Our evaluation showed, that inhibition can 
improve robustness by reducing redundant activities in the network [17]. To 
evaluate our spiking network, we apply this task by placing white pixels in 5% 
steps in all images of the MNIST test set and by measuring accuracy on these 
degraded digits. We compare our spiking network with two DCNNs. The first 
DONN is the well known LeNet 5 network from LeCun et al. [5]. The second one 
is based on the VGG16 network from Simonyan and Zisserman [19]. Both deep 
networks are trained on the MNIST data set and accuracy is measured on the 
test set with different levels of pixel erasement. 

We here follow the idea, that a biologically motivated model trained by Heb- 
bian Learning on natural scene should discover a codebook of features that can 
be used for a large set of classification tasks. Thus, we train our spiking model 
on small segments of natural scenes. As these image patches contain different 
spatial orientations and frequencies, we obtain receptive fields which are selective 
for simple features. With this generalized coding, we archived a recognition accu- 
racy of 98.08%. Further, our spiking network shows a good robustness against 
pixel erasement, even with only one layer of excitatory and inhibitory neurons. 


2 Methods 


Both deep convolutional networks are implemented in Keras v.2.0.6 [15] with ten- 
sorflow v.1.2.1 and Python 3.6. Our spiking network is implemented in Python 
2.7 with the neuronal simulator ANNarchy (v.4.6) [20]. To classify activity vec- 
tors of our network, we used a support vector machine with a linear kernel, using 
the Linear$VC package from the sklearn library v.0.19.1. 


2.1 Spiking Model 


Populations. The architecture of our spiking network (Fig. 1A) is inspired by 
the primary visual cortex and consists of spiking neurons in two layers. The input 
size is 18 x 18 pixels. We used randomly chosen patches out of a set of whitened 
natural scenes [4] to train our network. To avoid negative firing rates, positive 
values of a patch are separated in an On-part and negative values in an Off-part. 
Therefore, the first layer consists of 648 neurons in a 18 x 18 x 2 grid. Every 
pixel corresponds to one neuron in the layer. The neurons fire according to a 
Poisson distribution, whose firing rate is determined by the corresponding pixel 
values. The presented pixels are normalized with the absolute maximum value 
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Fig. 1. A: Schematic diagram of our spiking network. The input layer consists of 648 
neurons. The second layer consists of 324 excitatory and 81 inhibitory neurons. B: 
Example for the image distortion from the original digit to 90% pixel erasement. 


of the original image and multiplied with a maximum firing rate of 125 Hz. Each 
patch was presented for 125 ms. Learning was stopped after 400.000 patches. The 
presented patch was flipped around vertical or horizontal axis with a probability 
of 50% to avoid an orientation bias [7]. 

The neurons in the first layer are all-to-all connected to the neurons in the 
second layer. The second layer consists of a population of 324 excitatory and 81 
inhibitory neurons to achieve the 4:1 ratio between number of excitatory and 
inhibitory neurons as found in the visual cortex [3,13]. All neurons gather infor- 
mation from the whole presented input. Both populations consist of adaptive 
exponential integrate-and-fire neurons (AdEx) [7]. The description of the mem- 
brane potential u is presented in Eq. 1. The slope factor is ôr, C is the mem- 
brane capacitance, Ez is the resting potential and gz is the leaking conductance. 
The depolarizing after potential is described by z and waa is the hyperpolariz- 
ing adaption current. The input is denoted by Iere for excitatory and linn for 
inhibitory current. Input currents are incremented by sum of the presynaptic 
spikes of the previous time step, multiplied with the synaptic weight. 


du Va 
C = gL(u Ez) H gLATe Ar Wad + Z +4 Teze Linh (1) 


A spike is emitted, when the membrane potential exceeds the adaptive spiking 
threshold Vr. After a spike, the membrane potential is set to 29mV for 2 ms, 
and then it is set back to Ez. 


Excitatory Plasticity. The plasticity of excitatory connections from the first 
to the second layer, as well as connections from the excitatory to the inhibitory 
population within the second layer, follows the voltage-based STDP rule [7]. The 
development of the weight between a presynaptic neuron 7 and a postsynaptic 
neuron depends on the presynaptic spike event X, and the presynaptic spike 
trace T; as well as on the postsynaptic membrane potential u and two averages 
of the membrane potential uy and u_. The parameters A,rp and Azrp are 
the learning rates for long-term potentiation (LTP) and long-term depression 
(LTD). Both parameters 6, and 6_ are thresholds, which must be exceeded by 
the membrane potential or its long time averages. 
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The homoeostatic mechanism of the learning rule is implemented by the ratio 
between ü and a reference value ue. It adjusts the amount of emergent LTD to 
control the postsynaptic firing rate. Therefore, u implements a sliding threshold 
to develop selectivity of neurons. Clopath et al. [7] propose to equalize the norm 
of the OFF weights to the norm of the ON weights every 20s. We did this 
for the excitatory weights from the input layer to excitatory and the inhibitory 
population, per neuron. The weights are limited by an upper and lower bound. 


Inhibitory Plasticity. The connections from the inhibitory to the excitatory 
population and the lateral connections between the inhibitory neurons develop 
with the inhibitory learning rule from Vogels et al. [9] (see Eq. 3). 


Aw;; = n(£; — p) , for pre-synaptic spike (3) 
Awij = n(£;) , for post-synaptic spike 


The pre-synaptic spike trace is z; and the spike trace for the post-synaptic 
neuron is Zj. When the particular neuron spikes, the spike trace increases with 
one, otherwise it decays with 7; or Tj to zero. The inhibitory weight changes on 
a pre- or postsynaptic spike with the learning rate 7. 

The constant value p specifies the strength of inhibition to suppress the 
postsynaptic activity until LTD can occur. The inhibitory weights are limited 
by a lower and upper bound. 


2.2 Deep Convolutional Networks 


To assess the performance of our network approach on MNIST recognition, we 
compared it to two deep convolutional neural networks (DCNN). The first net- 
work is the well known LeNet 5, introduced from LeCun et al. [5]. It is hierar- 
chically structured with two pairs of 2D-convolutional and max-pooling layers, 
followed by two fully connected one-dimensional layers. The last layer is the 
classification layer with a “softmaz” classifier. The first convolutional layer has 
a kernel size of 3 x 3 pixels and 32 feature maps. The kernel size of the second 
convolutional layer is 3 x 3 too, but consists of 64 feature maps. For the second 
max-pooling layer, a dropout regularisation with a dropout ratio of 0.5 is used. 
Both max-pooling layers have a 2 x 2 pooling size. The architecture of the second 
model is based on the VGG16 network proposed by Simonyan and Zisserman 
[19]. As a consequence of the small input size, we have to remove the last three 
2D convolutional and the 2D max-pooling layer. Further, no dropout regulari- 
sation was done. This shortened model is further called VGG13. Both networks 
are learned for 50 epochs on the MNIST training set [5]. The validation accuracy 
is measured on 10% of the training set. The remaining 90% are used for learning. 
The adadelta optimizer [12] with p = 0.95 is used for both networks. 
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2.3 Measurement of Accuracy 


The MNIST images have a resolution of 28 x 28 pixels. Because of the input size 
of the spiking network with 18 x 18 pixels, we divided each image of the MNIST 
set into four patches with each 18 x 18 pixel size. The first patch was cut out 
at the upper left corner and a horizontal and vertical pixel shift of 10 pixels was 
done to cut out the other three patches. We presented every patch for 125 ms, 
without learning, and measured the number of spikes per neuron. We repeated 
every patch presentation ten times to calculate a mean activity per neuron on 
every patch. For every digit, a final activity vector consists of 324 x 4 = 1296 
values. We fitted a support vector machine (SVM) with the merged activity 
vectors of the training set. Before the fitting, we normed the activity vectors 
between zero and one. The SVM had a linear kernel, the squared hinge loss and 
the L2 penalty with a C-parameter of one. To measure accuracy, we used the 
merged activity vectors of the test set as input to fitted SVM and compared the 
known labels with the predictions of the SVM. Finally, we measured the accuracy 
of five separately learned networks and will present the average accuracy here. 

We measured the accuracy of both DCNNs by presenting the MNIST test set 
and comparing their prediction with the known labels. As for the spiking net- 
work, we measured the accuracy of five separately learned networks and present 
the average accuracy here. 

We calculated the f-score for all models and levels of pixel erasements as well. 
Because there is no difference to the accuracy noticeable, it is not shown here. 


2.4 Robustness Against Pixel Erasement 


In a previous study Kermani et al. demonstrated, that networks with biologi- 
cally motivated learning rules in combination with inhibitory synapses are more 
robust against a loss of information in the input. They measured the classifi- 
cation accuracy of their network for different levels of pixel erasement in the 
MNIST dataset [17]. Following this approach, we erased pixels of all digits in 
the MNIST test sets in 5% steps, erasing only pixels with a value above zero (see 
Fig. 1B). We created one data set per erasement level and showed each model 
the same dataset. For each level of pixel erasement we measured the number of 
correct classifications as mentioned above. Independently from number of erased 
pixels, the SVM has always been fitted with the activity vectors measured on 
the original training set. 


3 Results 


Our network achieved on the original MNIST test data set an average accuracy 
of 98.08% over five runs. If inhibition is removed, 96.81% accuracy is archived. 
The LeNet 5 implementation achieved 99.24% and the VGG13 network 99.41%, 
averaged over five runs (Table 1). Our results show, that at 25% erased pixels 
the spiking network achieves higher accuracy values than the LeNet 5 network, 
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Fig. 2. Classification accuracy as a function of level of pixel erasement. A, Robustness 
of our spiking network (blue line) is between LeNet 5 (green line) and VGG13 (red 
line) network. Deactivation of inhibition leads to a less robust spiking network (dashed 
blue line). B, First layer of LeNet 5 (dashed green line) is more robust than complete 
LeNet 5. Whereas the first layer of the VGG13 is less robust (dashed red line). (Color 
figure online) 


but lower values than the VGG13 network. We deactivated inhibition and mea- 
sured again accuracy on the different levels of pixel erasement. As mentioned by 
Kermani et al., the accuracy decreases without inhibition stronger than with it 
(see Fig. 2A) [17]. 

Our spiking network only consists of one layer of excitatory neurons. Because 
of that, we measured accuracy of the LeNet 5 and VGG13 only with the activ- 
ity of the first convolutional layer. Therefore, the output of the first layer was 
connected to a classification layer with 10 units and a softmax activation func- 
tion. Only the weights from convolutional to classification layer were trained on 
the MNIST training set. The classification on the pixel erased dataset was done 
as for the other deep networks. With an accuracy of 98.1% from the first layer 
of LeNet 5 and 97.29% of the first layer of the VGG13, the first convolutional 
layer alone achieved a lower accuracy on the original MNIST test set than the 
complete network (Table 1). By stepwise pixel erasement, the first layer of the 
LeNet 5 is slightly robuster than the complete network. In contrast the first layer 
of the VGG13 model is less robust than the complete model. The course of the 
curve is similar to our spiking network (Fig. 2B). The size of the receptive fields 
in our spiking model does not correspond to the size of the convolutional kernel 
in the DCNNs. Further, every feature map in the convolutional layer shares the 
same convolutional kernel. Our spiking network learns 324 different receptive 
fields. That would be equivalent to 324 different feature maps in a DCNN. To 
accommodate these differences between the spiking approach and the DCNNs, 
we changed the number of feature maps in the first convolutional layer and the 
kernel size in the LeNet 5 and the VGG13 network to 9 x 9 and 18 x 18. To avoid 
unnecessary computational load and possibility of over fitting we increased the 
number of feature maps only to 64 and 96. The increased kernel size in the LeNet 
5 implementation leads to a significant improvement (see Fig. 3A). However, for 
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the VGG13 model, it does not lead to a significant change (see Fig.3B). An 
increased number of feature maps in both DCNNs seems to have no effect on 
robustness. 
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Fig. 3. Classification accuracy for different configurations of the DCNNs. A LeNet 5 
with different numbers of feature maps (brighter lines) and larger kernel sizes (darker 
lines). B VGG13 with different numbers of feature maps (brighter lines) and larger ker- 
nel sizes (darker lines). More feature maps shows no change or slightly less robustness 
against pixel erasement. Larger kernel sizes lead to an improvement in LeNet5. 


Table 1. Accuracy values on deep convolutional networks LeNet 5 and VGG13, with 
different number of features and sizes for the kernel filter. Measured on original MNIST 
test set. Averaged over five runs per model. 


Architecture | Normal First layer only 64 features 96 features |9 x 9 kernel 18x 18 kernel 
LeNet 5 99.24% | 98.10% 99.38% 99.42% 99.03% 98.77% 
VGG13 99.41% | 97.29% 99.44% 99.43% 99.41% 99.32% 


4 Discussion 


Our proposed two layer spiking neural network (SNN) archived an accuracy of 
98.08% on the original MNIST data set. Previous unsupervised learned SNN 
have shown slightly weaker results on the MNIST data set [16,22]. Diehl and 
Cook [16] presented a two layer SNN with a similar architecture to the here 
proposed one. They achieved an accuracy of 95.0% with 6400 excitatory neurons 
and an accuracy of 87.0% with 400 excitatory neurons. In contrast to our spiking 
network, the excitatory population in their network is one-to-one connected to 
the inhibitory one to implement a lateral inhibitory effect between the excitatory 
neurons. Second, each neuron was connected to the full input of the MNIST 
data set and thus learned complete digits as receptive fields. After learning, they 
assigned every neuron a class, referred to the class with the highest activity on 
the training set [16]. The class of the most active neuron defined the prediction 
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of the network on the test set. Our network is learned on natural scene input [4] 
instead of images of the MNIST data set. Because presenting each neuron a small 
segment of different spatial orientations and spatial frequencies, our network 
learns Gabor-like receptive fields [7]. These feature detectors are selective for only 
a part of the presented input instead of a complete digit. Further on, classification 
for our approach is done by training a simple linear SVM with activity vectors 
of the excitatory population. Instead of only considering the activity of the 
most active neuron, here the classification includes the activity of all excitatory 
neurons. Therefore, different digits are decoded by the combination of different 
neuronal activities. This leads to a better classification accuracy with a smaller 
number of neurons. Another unsupervised spiking network was presented by 
Tavanaei and Maida [23], consisting of four layers. Their input consists of 5 x 
5 pixels sized overlapping patches, cut out of the MNIST training set. Every 
pixel value determines the rate of the input spike train for the neurons in the 
second layer. In the second layer exists lateral inhibitory connections between 
the neurons. This lead to Gabor-like receptive fields in the second layer. The next 
layer was a max-pooling layer, followed by a so called feature discovery” layer. 
After learning in the second layer was finished, they learned the fourth layer. The 
output of the fourth layer was used to train a SVM for the classification. They 
used four SVMs with different kernels and averaged them. With 32 neurons in 
the second and 128 neurons in the last layer they archived an accuracy of 98.36% 
on the MNIST test. 

A deeper unsupervised spiking approach was presented by Kheradpisheh et 
al. [22]. They presented a deep spiking network to mimic convolutional and max- 
pooling layers by using a temporal coding STDP learning algorithm. This means, 
that the first firing neuron learned most, while later firing neurons learned less 
or nothing. Their network consists of three pairs of a convolutional and a max- 
pooling layer. For classification, they used a linear SVM on the output of the last 
pooling layer. On the MNIST data set, they achieved an accuracy of 98.4% [22]. 
Their temporal coding implements a “winner takes it all” mechanism, what is less 
biologically plausible than the used learning rules in our approach. Nonetheless, 
the complex structure of the network from Tavanaei and Maida [23] and of the 
Kheradpisheh et al. [22] network is an evidence for possibility of unsupervised 
STDP learning rules in a multi-layer network. 

A comparison with two deep convolutional networks on stepwise pixel erase- 
ment showed, that our LeNet 5 implementation is less robust and the VGG13 
model is more robust than the here proposed spiking network (Fig. 2A). In case 
of accuracy is only been measured on the activity of the first layer, the LeNet 5 
first layer is more robust than the complete model. For VGG13, the first layer is 
less robust. The first convolutional layer of both models has a same kernel size 
(3 x 3) and number of features (32), but the robustness of both layers is differ- 
ent (Fig. 2B). Both deep convolutional neural networks (DCNNs) have different 
numbers of layers and a different order of convolutional and max-pooling layers. 
This suggests, that the structure of the network influences learning result in 
the first convolutional layer, especially how the error between output and input 
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is back propagated. In contrast to an increase of the number of features, an 
increase of the convolutional kernel size leads to an improvement of robustness 
(Fig. 3), but to a decrease in the accuracy on the original data set by the LeNet 
5 model (Table1). An increase of the number of features or the convolutional 
kernel size does not lead to a significant change for the VGG13 model. With a 
larger filter kernel, the erasement of a fixed number of pixels in the input has 
a lower influence on activity of the neurons. With a 3 x 3 kernel three erased 
pixels in the input cause a loss of 33.33% of the incoming activity and with a 
9 x 9 kernel is the loss only 3.7%. 

As mentioned in previous works [17], our results show that learned lateral 
inhibition leads to an improvement of classification robustness against pixel 
erasement in unsupervised neural networks. On one side, neurons loose sharpen- 
ing of their selectivity without inhibition [6,8]. On the other side, the correlation 
between the neuron activities increases. This leads to less distinct input encod- 
ing, that in turn decreases the robustness against pixel erasement [17]. The 
robustness in DCNNs is influenced by the learned feature maps as a result of 
the back propagation mechanism and the network architecture. Further, a larger 
size of the kernel filter improves the robustness. Whereas the number of feature 
maps are not that relevant. The absence of inhibition in DCNNs suggest, that 
not only the influence of inhibition on the neuronal activity improves robustness. 
Rather, filter size and structure of the learned filters are important for a robust 
behaviour. 
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Abstract. In recent days, deep learning has surpassed human perfor- 
mance in image recognition tasks. A major issue with deep learning 
systems is their reliance on large datasets for optimal performance. 
When presented with a new task, generalizing from low amounts of data 
becomes highly attractive. Research has shown that human visual cor- 
tex might employ sparse coding to extract features from the images that 
we see, leading to efficient usage of available data. To ensure good gen- 
eralization and energy efficiency, we create a multi-layer spiking con- 
volutional neural network which performs layer-wise sparse coding for 
unsupervised feature extraction. It is applied on MNIST dataset where 
it achieves 92.3% accuracy with just 500 data samples, which is 4x less 
than what vanilla CNNs need for similar values, while reaching 98.1% 
accuracy with full dataset. Only around 7000 spikes are used per image 
(6x reduction in transferred bits per forward pass compared to CNNs) 
implying high sparsity. Thus, we show that our algorithm ensures bet- 
ter sparsity, leading to improved data and energy efficiency in learning, 
which is essential for some real-world applications. 


Keywords: Sparse coding - Unsupervised learning 
Feature extraction - Spiking neural networks - Training data efficiency 


1 Introduction 


Deep learning [1] has been successfully used in recent times for computer vision, 
speech recognition, natural language processing, and other similar tasks. Avail- 
ability of large amounts of data and the processing power of GPUs are vital in 
training a deep neural network (DNN). The need for high processing power to 
enable performance has led to research on specialized hardware for deep learn- 
ing and algorithms that can make use of those hardware. Spiking neural net- 
works (SNNs) [2] are brain inspired networks which promise energy efficiency 
and higher computational power compared to artificial neural networks. Infor- 
mation is communicated using spikes and learning is done using local learning 
rules. 

© Springer Nature Switzerland AG 2018 
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In some real world applications like recognizing a new language, exploring 
new environment, etc., large datasets are initially unavailable. Extracting useful 
information with the little available data becomes a major metric when compar- 
ing algorithms for these tasks. Without enough data, DNNs fail to generalize, 
and hence, their performance on unseen data is bad. Gathering large amounts of 
data is a difficult task and training using it puts a high penalty on energy con- 
sumption. Attempts towards human-like learning, which is mostly unsupervised 
and can generalize with a few examples are being made [3] to solve the data 
availability issue. On the other hand, when large amount of data is available as 
part of a standard dataset, performance of SNNs are not on par with state of 
the art DNNs. Thus, a critical goal is to learn in an energy efficient manner with 
small data while getting comparable results with larger datasets. 

To achieve this goal, we use an improved sparse coding algorithm to train 
a multi-layer SNN layer-wise, in an unsupervised manner. Motivated by visual 
cortex of animals, sparse coding [4] can lead to efficient feature extraction as 
shown in Fig. 1. When patches of input image is given as input, basis vectors are 
learnt which can reconstruct the input. The learnt features are then passed to a 
layer trained in a supervised fashion for classification, which allows quantification 
of the quality of features extracted in terms of the accuracy obtained. 


Input Image Learnt Basis 


MA 
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ill 


Fig. 1. Overview of sparse coding. A basis was learnt to efficiently reconstruct all 
patches of the input image. An example of how a patch is sparsely reconstructed using 
basis filters is shown. 
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In this paper, learning rules used to train the network are inspired by SAILnet 
[5], which is shown to perform sparse coding in SNNs. We modify the SAILnet 
learning rules to improve the quality of features extracted and promote higher 
sparsity which, in turn, is seen to improve the prediction accuracy. We then 
show that our network learns better than a vanilla CNN when small amount of 
data is given, while using local learning rules and being more efficient in terms 
of energy required for a forward pass. Such a performance is extremely relevant 
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in applications like Internet of Things (IoT) or autonomous and mobile systems 
where large, labeled datasets are unavailable and energy is limited. 


2 Background and Related Work 


Spiking neural networks are becoming popular due to their energy efficiency, 
but they have not reached the accuracy levels given by DNNs. A possible reason 
for this is the lack of a general learning rule similar to backpropagation. Spike- 
timing-dependent plasticity (STDP) [6], training a DNN using backpropagation 
and transferring the weights to a SNN, sparse coding [4] are a few methods that 
have been tried to train SNNs. 

Olshausen and Field [4] showed that sparse coding with an overcomplete 
basis leads to learning of filters which are similar to those found in the visual 
cortex of animals. Given a basis (dictionary), a class of algorithms called locally 
competitive algorithms (LCA) can be used to find the optimal sparse coeffi- 
cients [7]. Further, it was proved in [8] that a SNN with lateral inhibition solves 
constrained LASSO problem and learns the optimal sparse coefficients. An algo- 
rithm to learn the dictionary in SNNs, called Sparse And Independent Local 
network (SATLnet) was proposed in [5]. Filters learnt using this algorithm were 
similar in shape to those found in biology. SAILnet is used to train one layer of 
convolutional filters in [9], which is extended to multiple layers in our work. 

In comparison to our approach which involves rate coding, i.e., information 
is coded as rate of spiking, there exists various other examples of sparse coding 
using STDP - which is essentially temporal coding, i.e., information is coded in 
the exact time of spiking. STDP with hard lateral inhibition is used in [10] for 
unsupervised layer-wise training of a spiking CNN, followed by a SVM for classi- 
fication. A similar architecture, with simplified STDP rule and a winner-takes-all 
(WTA) mechanism, is used in [11] to train the CNN. A multi-layer perceptron 
is used for classification. A fully unsupervised learning approach using SNNs is 
given in [12] where training is done using STDP and accuracy is calculated based 
on response of neurons. A non-local, gradient descent type learning rule is used 
in [13] to train individual layers of a multi-layer SNN similar to auto-encoders. 

All the above examples use temporal coding, while our approach has been 
to use rate based learning rules to enable rate coding. Rate coding is easier to 
implement in hardware and robust to noise since the exact temporal structure 
of spiking is not relevant and only the rate of spiking matters. Certain sensory 
and motor neurons are found to use rate coding, giving it a biological validation. 


3 Network Architecture and Learning Rules 


Our network architecture consists of multiple convolutional layers, each followed 
by a max pooling layer. The last max pooling layer is followed by a fully con- 
nected layer for classification. Each convolutional layer consists of spiking neu- 
rons performing sparse coding as explained in Sect. 3.1. Figure 2 shows the archi- 
tecture of our network for MNIST dataset which is chosen based on best accuracy 
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obtained during experiments. We use two convolutional layers, one with 12 filters 
of size 5x 5 and other with 64 filters of size 5x 5, both with stride of 1. Max 
pooling filters following both the layers are of size 2x 2. Fully connected layer 
is a single layer artificial neural network. 


Input Inhibitory Convolution Hewi, Seen 
i onvolution 
28x28 weights w) 12@24x24 12@12x12 OG Max Pooling 
d / 64@4x4 
f m Fully connected 
| Te = 
AA H 


2x2 MaX 5x5 convolution Pooling 


weights (Q) pooling 


5x5 convolution 


Fig. 2. Network architecture. 


3.1 Spiking Neural Network 


Our network uses spiking neurons in each convolutional layer. Since in a CNN, 
weights are shared between receptive fields, consider a single patch of image as 
the input and the SNN to be fully connected with number of outputs equal to 
the number of convolutional filters for the purpose of this discussion. For the first 
layer, current proportional to the intensity of input image is multiplied by the 
forward weights (Q)) and passed as input to the neurons. For subsequent layers, 
current proportional to the firing rate of neurons of previous layer is multiplied 
by the corresponding forward weights (QU) and passed as input to next layer. 
The neurons integrate the current and fire a spike on reaching a threshold (0). 
When a neuron spikes, other output neurons are inhibited through a negative 
current proportional to the inhibitory weights (WW). 

Mathematically, each neuron is leaky integrate and fire, maintaining an inter- 


nal variable VA 


vetje i-p Otn PORN aw), (1) 
k j 


, which is updated as 


where X is the input to I! layer, al) indicates whether neuron j spiked in 


the previous time step, and 7 (set to 0.1 in our experiments) is a parameter 

controlling the rate of decay of the internal variable. For each presentation of 

1) : 
is 


i 


the input, SNN is simulated for 50 time steps and the rate of spiking (n 
given by number of spikes divided by 50. 


3.2 Convolution and Max Pooling 


To perform the convolution operation, we divide the input into patches which 
are to be passed through convolutional filters. Each patch is then passed as input 
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to the SNN. Simulation of SNN is done and the firing rates are obtained which 
are used as inputs to the next layer. SNN simulation corresponding to individual 
patches can be performed in parallel since they are independent of each other. 

Max pooling layer simply picks the neuron with highest firing rate in its 
receptive field. 


3.3 Training 


Training is done layer-wise, i.e., a layer is fully trained and its weights are frozen 
before training the next layer. While our algorithm allows training all layers 
together, we found that the quality of features extracted was worse compared to 
layer-wise training. 

For training a layer, a SNN corresponding to that layer is first initialized 
and convolution operation is performed. Based on the firing rates of the output 
neurons, weights in SNN are updated according to the rules given in the next 
subsection. The new weights are used while simulating SNN for future inputs. 
This cycle of simulating SNN and updating weights is repeated for given number 
of input presentations. Once enough images are presented, SNN weights are 
frozen and firing rates corresponding to each patch of images are used as input 
to the next layer. 

When all convolutional layers are trained, the output of final max pooling 
layer is used as input to a fully connected artificial neural network which is 
trained to classify the dataset. 


3.4 Learning Rules 


The learning rules used to update weights of SNN are inspired by SAlLnet [5] 
and LCA [8] and lead to solving the sparse coding problem. 

Sparse coding tries to represent given input using a set of overcomplete basis 
vectors such that the components of the input in this new basis are sparse 
(as close to zero as possible). Mathematically, it involves solving the following 
optimization problem: 


m 


min Ix® -Y nP Qil +r sa), (2) 


ni oti j=1 


where x4) represents the j*” input sample, Q; are the basis vectors and nË ) 
are the coefficients corresponding to jt? input sample which can be used to 
reconstruct the input as x9) = Snt Qi. In our work, we have taken the 
sparsity penalty S(.) to be the L1 norm. 

Original SAILnet implementation updates the weights (Q, W) as well as 
the firing threshold of neurons (0) to solve the sparse coding problem. It uses 
a hyperparameter p, which is kept equal to a low value, to represent the target 
firing rate. We modify the SAILnet learning rules in our implementation as given 
in Table 1. 
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We keep @ constant similar to LCA as opposed to updating it as given in 
SAlLnet. This allows simpler neurons without varying thresholds to be used in 
the network. With 0 constant, —p? term in update for W does not make sense 
and empirically, we found that our modified rules gave an improvement in final 
accuracy given by the network. 


Table 1. Comparison of SAILnet learning rules and our modification. 


Original SAILnet Our modification 
AQ = Bni(tx — NiQri) | No modification 
AW; = a(nin; — p°) AW,; = anin; 
A0; = y(n; — p) 0 is constant 


Updating Q ensures correct reconstruction of input while updating W 
ensures that firing rates of neurons are independent. Due to lateral inhibition in 
the architecture, W also leads to sparsity. 

Gradient of the cost function with respect to Qpi is ni (ar — Y j Nn;Qxrj) but 
it is shown in [5] that this gradient can be approximated as above to make the 
learning rule local, without much loss in reconstruction error. 


4 Experiments 


This section describes the experimental setup, training method and the results 
obtained. We use MNIST dataset to evaluate our network. MNIST dataset con- 
sists of 60,000 training images and 10,000 test images of handwritten digits from 
0 to 9. All images are grayscale and 28 x 28 in size. 


4.1 Comparison of Learning Rules Using Fully Connected SNN 


First, we check if the modifications done to the SAlLnet learning rule lead to 
better performance of the network. To compare the quality of features extracted 
with our learning rule and the SAILnet baseline, we created a SNN which took 
whole MNIST images as the input and performed sparse coding using 25 output 
neurons. A fully-connected SNN is used since the difference between convolu- 
tional filters is hard to see visually. 

Figures3a and b show the filters learnt. Ideally, the filters should look like 
different digits since they are used to reconstruct MNIST images. But in the 
SAlLnet case, there are many filters which are a mixture of digits and all digits 
are not represented. With our modified learning rule, such mixed digits are 
significantly reduced and the diversity of shapes in the filters is increased. We 
believe it is because SAILnet updates 0 such that the firing rates are equal to 
a low value p, instead of ideally being close to zero, which drove some filters to 
learn redundant features for simple datasets like MNIST. Sparsity also improved 
with our modification, with an average of 65 spikes needed per image compared 
to 85 spikes when trained with original SAILnet rules. 
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Fig. 3. Filters learnt by using (a) SAILnet rules and (b) our modification (Sect. 4.1). 
All 10 digits are represented when using our modification as compared to 8 digits with 
SAlLnet. The marked filters are redundant since they are a mixture of multiple digits. 


4.2 Comparing Learning with Varying Data Size 


A random subset of data of size varying from 500 to 50000 is taken for training 
and validation. 75% of it is used for training and the rest is kept for validation. 
Both training and validation data are used to learn SNN weights in unsupervised 
manner. Training data is further used to train the supervised layer while the 
validation data is used to adjust the hyperparameters of the network. Accuracy 
is reported on an unseen subset of size 10000 and compared against two baselines. 
First baseline is randomly initializing the SNN, freezing the weights and training 
only the supervised layer (MLP baseline). This baseline shows the usefulness of 
features extracted by the convolutional layers. Second baseline is a vanilla CNN 
with same architecture but trained using backpropagation (CNN baseline). In 
all cases, no pre-processing or data augmentation is done. 

For training data of size 500, a = 10, 8 = 0.1 are used for SNN weight updates 
and 0 = 0.005 is taken as the firing threshold of neurons. Batch size of 100 is 
used and training is done for 1000 epochs. Number of epochs is scaled to keep the 
effective amount of updates same as the data size increases. Supervised layer and 
CNN baseline use Adam optimizer with learning rate 0.001 for backpropagation 
with same batch size and epochs. 


Accuracy. Figure 4 shows classification error as a function of training data size. 
Our method reaches 92.3% accuracy with 500 samples, increasing to 95.6% with 
3000 and 97.7% with 30000 samples respectively. With full dataset, the obtained 
accuracy is 98.1%. It can be seen from the figure that our method performs 
significantly better than baselines with small data while only becoming slightly 
worse than CNN baseline as the data size increases. 

Regenerative learning [13] outperforms CNN baseline but is worse compared 
to our method below 10000 data samples. It also has an additional disadvantage 
of needing a non-local learning rule and requiring the internal variable of the 


270 V. Bhatt and U. Ganguly 


neuron in weight updates. [9-11] report an accuracy of 98.36%, 98.4%, 98.49% 
respectively. Our method reaches close to those values. 
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Fig. 4. Classification error vs data size for MNIST data set. SNN is not trained for 
MLP baseline. Backpropagation is used to train CNN in CNN baseline. 


Sparsity. The learnt SNN weights promote sparsity and independence in the 
firing rate of neurons. Neurons in all layers combined, spike only 7000 number of 
times per image on an average. Lower number of spikes denotes efficient infor- 
mation transfer and also low energy usage if this is implemented in hardware. 
Figure 5a shows the distribution of firing rates of neurons. It can be observed 
that most neurons fire less than twice during a SNN simulation. Figures5b and 
c show the average correlation between firing rates of neurons for first and sec- 
ond layer respectively. Near zero values of off-diagonal elements show that firing 
rates are almost independent. The mean reconstruction error in the first layer is 
2.5 compared to 75.6 before training SNN weights. 

We hypothesize that sparsity plays a major role in being able to generalize 
with little data. Since inhibitory weights are indirectly controlling the amount of 
sparsity and a controls the amount of increase in inhibitory weights, reducing a 
reduces sparsity. With 500 data samples, the network is trained with various val- 
ues of a, keeping everything else constant. Figure 6 shows accuracy and average 
spikes per image as a function of a and it can be observed that lower sparsity 
indeed reduces the accuracy of the network. 

To perform a rough estimate of the advantage of spiking architecture and 
sparsity for energy efficiency, we consider the number of bits that are needed to 
during forward pass of an image. Since a spike can be represented using 1 bit, our 
network uses an average of 7000 bits per image. Vanilla CNN baseline that we 
use needs to transmit around 1300 non-zero floating point numbers per image, 
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Fig. 5. (a) Distribution of firing rates corresponding to all images. Most of the neurons 
spike at most once per image presentation. (b), (c) Average correlation between firing 
rates of neurons for first and second layer respectively. Off-diagonal elements are close 
to zero showing independence in firing rates. 
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Fig. 6. Classification error and average spikes per image vs a. Lower a implies lower 
inhibition and hence, lower sparsity which is leading to more errors in classification. 


translating to nearly 42000 bits which is 6 times worse than the performance of 
our network. 


5 Discussions 


Data and spike efficiency of our algorithm directly translates to energy efficiency 
when implemented in hardware. Examples of custom hardware for training SNNs 
are available in literature. Implementation of energy efficient algorithms in such 
custom SNN hardware is a promising approach to mobile, autonomous systems 
and IoT applications. Training and inference in such systems can be further 
optimized if the classification layer is also spiking based and implemented in 
similar hardware. 
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6 Conclusions 


In this paper, we present a method to train multi-layer spiking convolutional 
neural networks in a layer-wise fashion using sparse coding. We modify the 
learning rules given by SAlLnet to improve the quality of features extracted. 
These learning rules, combined with the training method, are observed to give 
better accuracy than vanilla CNN architecture when using small data. 92.3% 
accuracy is achieved with just 500 MNIST data samples, which is 4x less than 
what vanilla CNNs need for similar values. The network also efficiently trans- 
fers information between layers, using only 7000 spikes on average to represent 
a MNIST image, a 6x reduction in number of bits compared to CNN baseline. 
Such data and spike efficient algorithm will enable energy efficiency for mobile, 
autonomous systems and IoT applications. 
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Abstract. Bio-inspired energy efficient control is a frontier for autonomous 
navigation and robotics. Binary input-output neuronal logic gates are demon- 
strated in literature — while analog input-output logic gates are needed for 
continuous analog real-world control. In this paper, we design logic gates such 
as AND, OR and XOR using networks of Leaky Integrate-and-Fire neurons with 
analog rate (frequency) coded inputs and output, where refractory period is 
shown to be a critical knob for neuronal design. To demonstrate our design 
method, we present contour tracking inspired by the chemotaxis network of the 
worm C. elegans and demonstrate for the first time an end-to-end Spiking 
Neural Network (SNN) solution. First, we demonstrate contour tracking with an 
average deviation equal to literature with non-neuronal logic gates. Second, 2x 
improvement in tracking accuracy is enabled by implementing latency reduction 
leading to state of the art performance with an average deviation of 0.55% from 
the set-point. Third, a new feature of local extrema escape is demonstrated with 
an analog XOR gate, which uses only 5 neurons — better than binary logic 
neuronal circuits. The XOR gate demonstrates the universality of our logic 
scheme. Finally, we demonstrate the hardware feasibility of our network based 
on experimental results on 32 nm Silicon-on-Insulator (SOD based artificial 
neurons with tunable refractory periods. Thus, we present a general framework 
of analog neuronal control logic along with the feasibility of their implemen- 
tation in mature SOI technology platform for autonomous SNN navigation 
controller hardware. 


Keywords: Spiking Neural Network - Motor control 
Neuromorphic computing 


1 Introduction 


Spiking Neural Networks (SNNs) are third generation Artificial Neural Networks that 
attempt to model neurons as computing units with underlying temporal dynamics that 
resembles the spiking nature of biological neurons. While SNNs have been used to 
solve a variety of problems in classification and regression, an equally intriguing aspect 
is the implementation of control in a natural setting that could serve the dual purpose of 
(1) demystifying complex biological behavior and (ii) inspiring efficient robotics 
applications. Chemotaxis in Caenorhabditis elegans (C. elegans) is an example of such 
a biological behaviour which requires control. C. elegans is a free living nematode, 
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which can sense a large number of chemicals including NaCl. This ability allows these 
worms to find and subsequently move along a set point in the chemical concentration 
space so as to locate food sources. Typically, the sensory neurons ASEL and ASER 
provide chemical gradient information [4], and this information is used by interneurons 
in the worm to decide the direction to subsequently move along to reach the chemical 
set-point. The output of this computation is fed to motor neurons, which actuate their 
motion. Santurkar et al. [3] proposed a SNN model for chemical contour tracking 
inspired by C. elegans. They demonstrate the superiority of spiking architectures over 
non-spiking models and their tolerance to noise using the biologically realistic model of 
sensory neurons proposed in [4]. However, the inter-neuronal operations required to 
drive motor neurons were computed without using neural circuits. Instead, an artificial 
mathematical computation was used. Hence the SNN is not performing integrated, end- 
to-end control of all three stages of computation i.e. (i) sensory neuron (ii) interneuron 
(iii) motor neuron levels. Such external control is neither biologically realistic nor 
energy-area efficient. Further, more sophisticated/realistic behaviour, such as escaping 
a local extrema, without which the worm fails to reach the desired concentration over 
arbitrary concentration landscapes, has not been demonstrated in neuronal circuits. 

Existing SNN based logic gates [11-14] encode binary logic values using fixed 
spiking frequencies (low/high). But, the output spiking frequency of the gate should 
vary proportionately to one or more input spiking frequencies so that the worm turns in 
proportion to the urgency of sensory signals. This motivates the design of analog rate 
coded logic gates. 

In this paper, first, we implement analog rate-coded logic gates (AND, OR, XOR) 
by designing neuronal responses using refractory periods. Second, we integrate AND 
and OR with the sensory and motor neurons to demonstrate end-to-end control in the 
chemotaxis network. Third, we incorporate an additional sub-network using the XOR 
gate to escape a local extrema. The XOR, being a universal gate, also enables random 
logic circuit implementation. Our design enables a reduced number of neurons for logic 
gates which leads to lower response latency (measured between sensory input and 
motor neuron output), critical for many control applications. Fourth, we modify the 
response of sensory neurons proposed in [3] to reduce response latency to enable 
significantly improved tracking compared to state-of-the-art. Finally, a hardware neu- 
ron with configurable refractory period is demonstrated on a highly matured 32 nm 
silicon-on-insulator CMOS technology. 


2 Network Architecture 


Figure 1 shows the proposed SNN architecture for chemotaxis in C. elegans. All the 
neurons in our network are Leaky Integrate and Fire neurons. The following sections 
will discuss the functional role of all the neurons used in this network. 
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Fig. 1. Block diagram of the proposed SNN for contour tracking. N1, Na, N3 and Na are sensory 
neurons, receiving input from the concentration sensor. Ns, Ne, N7, Nj2 are motor neurons. 
Ns, No, Nio, Nii are the interneurons used to implement the XOR sub-network. The spiking 
frequency of Nj is the XOR of the spiking frequencies of Ns and N6. 


2.1 Turning Left and Right: The AND Sub-network 


2.1.1 Sensory Neurons 

As shown in Fig. 1, the neurons N, and N; are threshold detectors. N, fires when the 
current concentration (C) is greater than the set-point Cr i.e. C > Cr, while N3 fires 
when C < Cr. A hard-threshold (ideally a step-function) is compared to a soft-threshold 
based N, in Fig. 2. N; and N, are gradient (dC/dt) detectors, firing respectively for 
positive (dC/dt > 0) and negative (dC/dt <0) changes in concentration. The input to 
all four sensory neurons, at each time step (t), is the concentration at the current 
location of the worm. The equations for the ionic currents that implement the required 
responses have been delegated to the appendix. 


2.1.2 Motor Neurons 

The target for the worm is to reach the desired concentration set-point Cr. If at a 
particular time instant, the worm detects dC /dt > 0 (i.e. N3 spikes) and C > Cr (i.e. My 
spikes), the worm infers that it is moving away from Cr and hence tries to turn around. 
In this case the worm turns right by 3° and moves forward at a velocity of 0.01 mm/s 
with the rate of turning being proportional to dC/dt. The motor neuron N5 encodes this 
command. Hence, the spiking frequency of Ns has to be the output of an AND 
operation over the spiking frequencies of N; and N3. i.e. N; = AND(N;, N3). The bold 
face is used to denote spiking frequency of the corresponding neuron. Similarly, the 
motor neuron Ng spikes if dC/dt<0 (i.e. N4 spikes) and C < Cr (i.e. Nz spikes). In this 
case, the worm turns left by 3° at a velocity of 0.01 mm/s and the motor neuron N6 
encodes this command i.e. Ne = AND(N2, N4). When dC/dt > 0 and C<Cr or 
dC/dt<0 and C > Cr, the worm infers that it is moving towards Cr and hence keeps 
moving forward at a constant velocity without turning. 


2.1.3 Design Principles for the AND Sub-network 
Under the rate-coded approximation (which implies that injected current is assumed to 
be proportional to the spiking frequency), a neuron fires if the sum of the input spiking 
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frequencies (f;) multiplied by the corresponding synaptic weights (w;) is greater than fin. 
Hence the general firing condition for a neuron with k pre-synaptic neurons is: 


k 


Yo wii > fa (1) 


i=l 


Without a refractory period, the spiking frequency of N3 varies linearly with the 
observed gradient. This gradient can be very large in some parts of the environment, 
leading to very high spiking frequency, which in turn can make Ns spike by itself even 
if N, is not spiking. The saturation in the responses of N, and N3 ensures that N5 only 
fires when both N, and N; fire and hence acts as an AND gate. We choose w; and w3 
such that: wifi max = W3f3,max = fin » Where fi max, max are respectively the maximum 
spiking frequencies of N, and N3, and f,, is some value close to, but smaller than fin 
We choose f;, to be close to fın so that even a small value of f; will lead to N5 spiking, 
hence ensuring the control circuit’s sensitivity to very small gradients as well. 


f (a) (b) f (c) 


ferit ferit N3:dC/dt >0 


AAA 


Input 


dC /dt 


Fig. 2. (a) A typical LIF neuron (blue dashed line) has almost a rectified linear unit (ReLU) 
behaviour where the slope decreases with the membrane time constant Trc Of the LIF neuron. 
Adding a refractory period (tres) limits maximum frequency (fax) to 1/ Tref. (b) N] neuron f (C) 
behaviour where it fires when C exceeds Cr. Softer threshold initiates spiking before the hard- 
threshold. (c) N3 neuron has f(dC/dt) behaviour which has a spike frequency proportional to the 
dC/dt > 0; Both N, and N3 have an fax <feri such that neither can individually cause Ns to 
spike but they need to fire together to cause N; to fire to enable the analog AND operation where 
Ns fires proportionally to N3 only if N, also fires. Otherwise, Ns does not fire. (Color figure 
online) 


The responses of the threshold detectors (N, and N2) were taken as step functions in 
[3] with the transition at the desired set-point. This discontinuous response is softened 
to a sigmoid (as shown in Fig. 2(b)) by introducing a refractory period (chosen using 
the same logic as for N3). The onset of the sigmoidal response is chosen to be before 
the set-point, allowing the worm to turn a little before it has reached the set-point. This 
enables latency reduction and closer tracking of the set-point. Identical reasoning holds 
for the N», N4 and N; sub-network. 
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2.2 Random Walk 


When the worm is on flat terrain and hence no gradient is detected (AC 0), the 
worm explores its surroundings randomly. This random search is initiated by motor 
neuron N7. The spiking of N, causes the worm to move at an increased velocity of 
0.3 mm/s and to randomly turn by an angle uniformly distributed in [-22.5°, 22,59], 
This strategy allows for rapid exploration of a local space, with N, continuing to fire 
until a gradient is detected. 


2.3 Escaping Local Extrema: The XOR Sub-network 


When the worm has found, and is tracking the set-point, Ns and N6 fire alternately as 
the worm keeps swerving left and right. However if only N; or only Ng fires exclu- 
sively, then the worm is only turning left or right and hence going around in circles. 
Such a scenario is described in Fig. 3. If the worm starts anywhere in the valley, it will 
not be able to get out, as every time it moves up towards the rim of the valley (i.e. 


dC}, > 0), N3 fires with continuous firing of N, as C > Cr at every point. As a 
consequence, Ns will fire, making the worm turn back towards the basin. The worm 
then moves straight and now climbs up the other side and this process repeats. 
A second case where the worm would again be stuck is the scenario obtained when 
Fig. 3 is inverted on its head i.e. a small peak surrounded by a valley. 


| e 


Att = ty: Worm above Cy, +ve | Att = tz: Clockwise movement 
gradient seen, NS spikes ! due to NS, no more spiking 


en 


i 
At t = t4: Again +ve gradient | At t = ty: Worm back to the starting 
seen, N5 spikes and the worm | position, starts seeing +ve gradient on 
oscillates inside the valley | the other side 


Concentration Space 


Fig. 3. Panels depicting the worm stuck in a valley at four consecutive time steps, in the absence 
of the XOR subnetwork. 


Hence to solve this problem of getting stuck close to a local extrema, the XOR sub 
network is developed whose output is N12. N12 is supposed to fire, if only Ns or only Ng 
is found to spike over some time period, i.e. N72 = XOR(N5, No). When N32 fires, the 
worm moves straight for 10 s, without turning, at a velocity of 0.5 mm/s and then 
resumes normal operation, having escaped the area where it was stuck. Such behavior 
has been observed in biology as well [5]. 
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2.3.1 Design Principles for the XOR Sub-network 

To enable XOR function, only the spiking events at N5 or N¿ need to be detected. For 
such operation, we introduce a large refractory period of 10 s in the interneurons Ng 
and No and also set a very low voltage threshold for both these neurons, such that a 
single spike from N5 is enough to make Ng fire. Once Ng fires, it will remain unre- 
sponsive for 10 s due to the refractory period. Ng hence acts as a timed event detector. 
The same description holds for Ne and No. 

If we consider a long enough time period (~ 10 s) for our control problem, and Ng 
fires once before and once after this period, without any spike from No, we infer that 
only N5 has been firing for a significant amount of time. Hence, the worm needs to 
escape from this region. Similarly, if No fired once before and once after a period of 
10 s with Ng not firing in between, the worm must escape this area. 

We design Nj such that it fires once for every two times that Ng fires. Note that if 
No fires intermittently in the refractory period of Ng, then Njo will not fire due to the 
inhibitory connection linking No to No. Interchanging the roles of Ng and Ny yields the 
behavior of N, |. 

It is important to note that the current injected into Nio and N,ı by Ng and Ny decay 
at a time scale much faster than the refractory period. Thus we chose very small values 
for the membrane conductance of Ng and No i.e. these two neurons are not very leaky 
and effectively function as integrators over this time-scale. Finally, N,» has a low 
spiking threshold, and functions as an OR gate. It fires when either No or Nj fires, i.e. 
Ny = OR(Njo, Ni). The firing of N], causes the worm to move straight for 10 s, 
without turning. Ns, N, and Nz are inhibited from firing during this 10 s period by 
injecting them with a large inhibitory EPSP current with timescale of the order of 10 s. 


3 Results: Worm Dynamics 


Our simulated worm is placed in a chemotaxis assay of dimensions 10 cm x 10 cm, with 
some arbitrary concentration distribution of the chemical NaCl. Figure 4 demonstrates 
the AND operation with the concentration seen by the worm and corresponding spiking 
patterns for N,, N; and Ns. In Fig. 4, N, uses ahard threshold to fire for C > Cr (Fig. 4 
(b)) and N; fires for C > Cr (Fig. 4(c)) which produces an AND behaviour at Ns with 
significant latency (Fig. 4(d)). Figure 5 shows the behavior of our simulated worm for 
Cr = 54 mM. The worm moves about randomly at first, and then follows a gradient until 
it reaches the set-point and then continues to closely track the set-point, Cr. We observe 
that the worm swerves left and right, as it is slightly overshoots the tracking concentra- 
tion, corrects it course and this process repeats. The corresponding concentration seen by 
the worm, shown in Fig. 6 shows an average 0.82% (absolute) deviation from set-point 
(as a fraction of the range of concentration in this space). 

In Fig. 7, N, uses a pre-emptive soft threshold to fire earlier for C > Cr (Fig. 7(b)) 
and N3 fires for C > Cr (Fig. 7c)). This produces an AND behaviour at Ns with reduced 
latency (Fig. 7(d)). Figure 9 shows the concentration seen by the worm as it traced the 
trajectory in Fig. 8 to show 0.55% tracking accuracy, which is a 1.5 x improvement over 
that in Fig. 6 due to the pre-emptive soft threshold. Figure 10 shows a simulated scenario 
where the worm gets stuck in a local minimum and is unable to escape. With the XOR 
sub-network added to our SNN, it can be seen that the worm can successfully come out of 
the concentration valley and starts tracking the set point as shown in Fig. 11. 


Fig. 4. (a) Concentration 
vs time (b) Response of 
N, for C > Cr with hard 
threshold and (c) N3 for 
dC/dt > 0 which produces 
(d) an AND function res- 
ponse at Ns with a sign- 
ificant latency (red arrow). 
(Color figure online) 


Fig. 7. (a) Concentration 
vs time (b) Response of 
N, with soft threshold 
for C > Cr and (c) N3 
for dC/dt > 0 which pro- 
duces (d) an AND func- 
tion response at Ns with 
a reduced latency (red 
arrow). (Color figure 
online) 
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(a) 


Fig. 5. Contour tracking with 
hard thresholding, Cr = 54 mM. 
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Fig. 10. Worm stuck in a valley (XOR sub- Fig. 11. The part of the trajectory marked in 

network is disabled). red is traversed when N; fires, allowing the 
worm to escape and then resume tracking the 
Cr. (Color figure online) 


4 Benchmarking 


Table 1 benchmarks our network with previously reported contour tracking algorithms. 
We achieve state-of-the-art performance, with lower spiking frequencies, making our 
network more energy efficient. Table 2 shows the efficiency of our XOR gate imple- 
mentation in terms of number of neurons used. It also works in an analog fashion 
unlike other reported SNN based gates, which is essential for our network. 


Table 1. Benchmarks for contour tracking Table 2. Benchmarks for design of the 
algorithm XOR gate 


No. of neurons for 
XOR 


Max Freq | External Average 
(Hz) Bias Current| Deviation(%) 


Model Input Values 


Delaney etal. [11] 
Wade et al. [14] 
Berger et al. [12] 
Ferrari et al. [13] 
This Work 


Non-SNN 


This Work 


5 Hardware Feasibility 


Hardware realization of such a SNN calls for both the feasibility as well as des- 
ignability of the neuronal response. Recently our group has proposed and experi- 
mentally demonstrated a SOI MOSFET based LIF neuron [2]. The neuronal 
functionality has been achieved by using the SOI transistor’s intrinsic carrier dynamics. 
The response of the SOI neuron shows high sensitivity with MHz order frequency 
range. 

Figure 12a shows the TEM image of the fabricated SOI neuron. Figure 12b 
shows the response curves of such SOI neuron for different refractory periods (tref). 
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Fig. 12. (a) TEM image of the PD SOI MOSFET fabricated using 32 nm SOI technology [2]. 
(b) Experimental frequency vs. input curve. Without tef, the response increases sharply with 
input, whereas adding tef limits the frequency range. (c) Block diagram demonstrating the 
implementation of refractory period in SOI neuron. The neuron generates current output which is 
fed to the threshold detector. At threshold, the driver circuit elicits a spike enabling the timer 
circuit. The timer circuit deactivates the neuron during the refractory period. The reset circuit 
initializes the neuron. The expected transient output is shown at the output of three circuit block. 


Without any refractory period, the response keeps increasing with input stimuli. 
Addition of t,e¢ limits the firing rate and the frequency saturates at a particular value 
like biological neurons. Such a tunable response provides freedom in SNN design for 
various applications and also aids the scope of hardware implementation. Figure 12c 
shows the block diagram for the implementation of refractory period in SOI neuron. 
The proposed electronic neuron is highly energy (35 pJ/spike) efficient and con- 
sumes lesser area (~ 1700 F” at 32 nm technology node) compared to state of the art 
CMOS neurons. 


6 Conclusions 


A complete end-to-end SNN based control circuit is proposed for chemotaxis in C. 
elegans. To implement this, analog rate-coded AND, OR, and XOR logic gates based 
inter-neuronal circuits are proposed. We implemented these gates using a small number 
of neurons, allowing for energy and area efficiency as well as reduced network latency, 
which is crucial for many robotics applications. The network latency was further 
reduced by modifying the response of the threshold detecting sensory neurons. We 
ensure correct operation of the network over arbitrary concentration ranges and choose 
parameters of the network using an analytic approach designed using the rate-coded 
approximation. The neuronal behaviors required to implement the neural logic gates are 
achieved by LIF neurons with configurable refractory periods. State-of-the-art accuracy 
of tracking is demonstrated (<0.6% deviation from set-point). To address the problem 
of being stuck around a local extrema en route to a set-point, we designed a novel XOR 
based sub-network that presents a biologically relevant solution. As XOR is a universal 
gate, this enables the implementation of any arbitrary logical functions in SNN. 
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Further, we show hardware implementation of such neurons on advanced 32 nm SOI 
platform. 


Acknowledgement. The authors wish to acknowledge Nano Mission & MeitY, Government of 
India, for providing funding for this work. 


Appendix: LIF Model and Ionic Currents 


All the neurons used in our model are LIF neurons with refractory periods [17]. N7, No, 
N; and N, have ionic channels that inject input current I.(t). The specific nature of I.(t) 
is what allows them to functions as threshold and gradient detectors. The ionic currents 
injected into N, and N; respectively are I.,(t) and Ie2(t), given as: 


L(t) = L omax(0, C — Cr — ô); I(t) = L omax(0, Cr + 6 — C) (2) 


The ò governs the degree of preemptive response of the threshold detectors. A set 
of equations that define I.3(t) and Ie4(t) was proposed in [4] and used in [3]. We also use 
the same equations for the gradient detectors N; and N4. These equations along with a 
detailed explanation can be found in Sect. II of [3]. 


References 


1. Maas, W.: Networks of spiking neurons: the third generation of neural network models. 
Neural Netw. 10(9), 1659-1671 (1997) 

2. Dutta, S., et al.: Leaky integrate and fire neuron by charge-discharge dynamics in floating- 
body MOSFET. Sci. Rep. 7, 8257 (2017) 

3. Santurkar, S., Rajendran, B.: C. elegans chemotaxis inspired neuromorphic circuit for 
contour tracking and obstacle avoidance. In: Neural Networks, CNN (2015) 

4. Appleby, P.A.: A model of chemotaxis and associative learning in C. elegans. Biol. Cybern. 
106(6-7), 373-387 (2012) 

5. Gray, J.M., Hill, J.J., Bargmann, C.I.: A circuit for navigation in Caenorhabditis elegans. 
Proc. Natl. Acad. Sci. U. S. A. 102(9), 3184-3191 (2005) 

6. Galarreta, M., Hestrin, S.: Fast spiking cells and the balance of excitation and inhibition in 
the neocortex. In: Hensch, T.K., Fagiolini, M. (eds.) Excitatory-Inhibitory Balance. Springer, 
Boston (2003). https://doi.org/10.1007/978-1-4615-0039-1_11 

7. Kato, S., et al.: Temporal responses of C. elegans chemosensory neurons are preserved in 
behavioral dynamics. Neuron 81(3), 616-628 (2014) 

8. Liu, Q., Hollopeter, G., Jorgensen, E.M.: Graded synaptic transmission at the Caenorhabditis 
elegans neuromuscular junction. Proc. Natl. Acad. Sci. U. S. A. 106, 10823-10828 (2009) 

9. Goldental, A., et al.: A computational paradigm for dynamic logic-gates in neuronal activity. 
Front. Comput. Neurosci. 8, 52 (2014) 

10. Yang, J., Yang, W., Wu, W.: A novel spiking perceptron that can solve XOR problem. 
ICS AS CR (2011) 

11. Reljan-Delaney, M., Wall, J.: Solving the linearly inseparable XOR problem with spiking 
neural networks. https://doi.org/10.1109/sai.2017.8252173 


12. 


13. 


14. 


15. 


16. 


17. 


Design of Spiking Rate Coded Logic Gates for C. elegans 283 


Berger, D.L., de Arcangelis, L., Herrmann, H.J.: Learning by localized plastic adaptation in 
recurrent neural networks (2016) 

Ferrari, S., et al.: Biologically realizable reward-modulated Hebbian training for spiking 
neural networks. In: Neural Networks, IJCNN (2008) 

Wade, J., et al.: A biologically inspired training algorithm for spiking neural networks. 
Dissertation. University of Ulster (2010) 

Kunitomo, H., et al.: Concentration memory-dependent synaptic plasticity of a taste circuit 
regulates salt concentration chemotaxis in Caenorhabditis elegans. Nat. Commun. 4, 2210 
(2013) 

Suzuki, H., et al.: Functional asymmetry in Caenorhabditis elegans taste neurons and its 
computational role in chemotaxis. Nature 454(7200), 114 (2008) 

Naud, R., Gerstner, W.: The performance (and limits) of simple neuron models: 
generalizations of the leaky integrate-and-fire model. In: Le Nov£re, N. (ed.) Computational 
Systems Neurobiology. Springer, Dordrecht (2012). https://doi.org/10.1007/978-94-007- 
3858-4_6 


()) 


Check for 
updates 


Gating Sensory Noise in a Spiking 
Subtractive LSTM 


Isabella Pozzi‘, Roeland Nusselder, Davide Zambrano, and Sander Bohté 


Centrum Wiskunde & Informatica, Amsterdam, The Netherlands 
isabella.pozzi@cwi.nl 


Abstract. Spiking neural networks are being investigated both as bio- 
logically plausible models of neural computation and also as a potentially 
more efficient type of neural network. Recurrent neural networks in the 
form of networks of gating memory cells have been central in state-of- 
the-art solutions in problem domains that involve sequence recognition or 
generation. Here, we design an analog Long Short-Term Memory (LSTM) 
cell where its neurons can be substituted with efficient spiking neurons, 
where we use subtractive gating (following the subLSTM in [1]) instead of 
multiplicative gating. Subtractive gating allows for a less sensitive gating 
mechanism, critical when using spiking neurons. By using fast adapting 
spiking neurons with a smoothed Rectified Linear Unit (ReLU)-like effec- 
tive activation function, we show that then an accurate conversion from 
an analog subLSTM to a continuous-time spiking subLSTM is possible. 
This architecture results in memory networks that compute very effi- 
ciently, with low average firing rates comparable to those in biological 
neurons, while operating in continuous time. 


Keywords: Spiking neurons - LSTM - Recurrent neural networks 
Supervised learning - Reinforcement learning 


1 Introduction 


With the manifold success of biologically inspired deep neural networks, networks 
of spiking neurons are being investigated as potential models for computational 
and energy efficiency. Spiking neural networks mimic the pulse-based communi- 
cation in biological neurons: in brains, neurons spike only sparingly — on average 
1-5 spikes per second [2]. A number of successful convolutional neural networks 
based on spiking neurons have been reported [3-7], with varying degrees of bio- 
logical plausibility and efficiency. Still, while spiking neural networks have thus 
been applied successfully to solve image-recognition tasks, many deep learning 
algorithms use recurrent neural networks (RNNs), especially variants of Long 
Short-Term Memory (LSTM) layers [8] to implement dynamic kinds of memory. 
Compared to convolutional neural networks, LSTMs use memory cells to store 
select information and various gates to direct the flow of information in and out 
of the memory cells. The state-changes in such networks are iterative and lack an 
© Springer Nature Switzerland AG 2018 
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intrinsic notion of continuous time. To translate LSTMs-like networks into net- 
works, such a notion of time has to be included. At present, the only spike-based 
version of LSTM has been realized for the IBM TrueNorth platform [9]: this work 
proposes an approximate LSTM specifically for TrueNorth’s constrains by using 
a store-and-release mechanism synchronized across its modules, effectively still 
iterative and synchronized model of computation; Intel recently introduced the 
first semi-commercial spike-based hardware [10], obviating the need for efficient 
and effective spiking neural network algorithms. Here, we propose a biologically 
plausible spiking LSTM network based on an asynchronous approach. While a 
continuous time model in LSTMs can be implemented by taking small, finite 
time-steps, a key problem in spiking LSTM models is the multiplicative nature 
of the gating mechanism: such gating requires a graded response from spiking 
neurons to create a gradient for learning the proper degree of gating. We found 
that multiplicative gating also needs to be precise, in that noisy gating signal 
disturbed the learning of memory tasks. We exploit subtractive gating, the “sub- 
LSTM” [1], to use spiking neurons that effectively compute a fast ReLU function, 
enabling a spiking subLSTM network to operate in continuous time. We con- 
struct a spiking subLSTM network and successfully demonstrate the efficacy of 
this approach on two standard machine learning tasks: we show that it is indeed 
possible to use standard analog neurons for the training phase of the modified 
subLSTM and accurately convert the networks into spiking versions, such that 
during inference phase spike-based computation is sparse (comparable to active 
biological neurons) and efficient. 


2 Model 


To construct a spiking subLSTM network, we first describe the Adaptive Spik- 
ing Neurons we aim to use, and we show how we can approximate their effective 
corresponding activation function. We then show how an LSTM network com- 
prised of a spiking memory cell and a spike-driven input-gate can be constructed 
and we discuss how analog versions of this subLSTM network are trained and 
converted to spiking networks. 


Adaptive Spiking Neuron. The requirements of the network architectures 
guide us in the demands put on spiking neuron models. Here, we use Adaptive 
Spiking Neurons (ASNs) as described in [11]. ASNs are a variant of an adapting 
Leaky Integrate & Fire (LIF) neuron model that includes fast adaptation to the 
dynamic range of input signals. The behavior of the ASN is determined by the 
following equations: 


; t -t 
incoming postsynaptic current: I(t) = > > w¡0(t,) exp ( z ) (1) 
P Tg 
te 


i 


input signal: S(t) = (ġ x I(t), (2) 
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threshold: v(t) = do + Y myV(ts) exp E — ") » (8) 


internal state: S(t) = 5 U(ts) exp 6 — =) , (4) 
= 


where w; is the weight (synaptic strength) of the neuron’s incoming connection; 
ti < t denote the spike times of neuron i, and ts < t denote the spike times 
of the neuron itself; &(t) is an exponential smoothing filter with a short time 
constant Tg; Vo is the resting threshold; m; is a variable controlling the speed of 
spike-rate adaptation; Tg,T,,7, are the time constants that determine the rate 
of decay of I(t), V(t) and S(t) respectively. The ASN emits spikes following a 
firing condition defined as S(t) — S(t) > ote) and, instead of sending binary 
spikes, the ASNs here communicate with “analog” spikes of which the height is 
equal to the value of the threshold at the time of firing; note that this model 
speculatively implies a tight coupling between spike-triggered adaptation and 
short-term synaptic plasticity (see [12] and [11] for more details). 


Activation Function of the Adaptive Analog Neuron. In order to create 
a network of ASNs that performs correctly on typical LSTM tasks, our approach 
is to train a network of Adaptive Analog Neurons (AANs) and then convert the 
resulting analog network into a spiking one, similar to [5,6,11]. We define the 
activation function of the AANs as the function that maps the input signal S 
to the average PSC I that is perceived by the next (receiving) ASN. We then fit 
the normalized spiking activation function with a softplus-shaped function as: 


AAN(S) = a: log (1 +b- exp(c- S)), (5) 
with derivative: 


dAAN(S) _a-b-c-exp(c- S) (6) 
dS 1+b-exp(c-S)” 


AAN : == ASN 


-2 0 2 4 6 -2 0 2 4 6 
Input current Input current 


Fig. 1. Left panel: average output signal of the ASN as a function of its incoming PSC 
I, where the error bars indicate the standard deviation of the spiking simulation, and 
the corresponding AAN curve. The shape of the ASN curve is well described by the 
AAN activation function, Eq. 5; right panel: the output signal of the ASN alone. 
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Fig. 2. Overview of the construction of an Adaptive Analog subLSTM and an Adaptive 
Spiking subLSTM cell. This compares to a subLSTM with only an input gate. 


where, for the neuronal parameters used, we find a = 0.04023, b = 1.636 and 
c = 23.54. Using this mapping from the AAN to the ASN (see Fig. 1), the 
activation function can be used during training of the network with analog AANs: 
thereafter, the ASNs are used as “drop in” replacements for the AANs. The ASNs 
use Ty = Tg = Ty = 10 ms, and Vo and m; are set to 0.3 and 0.18 for all neurons. 


Adaptive Spiking subLSTM. An LSTM cell usually consists of an input and 
output gate, an input and output cell and a CEC [8]. Deviating from the origi- 
nal formulation and more recent versions where forget gates and peepholes were 
added [13], the LSTM architecture as we present it here only consists of a (sub- 
tractive) input gate, input and output cells, and a CEC. Moreover, the original 
formulation, an LSTM unit uses a sigmoidal activation function in the input gate 
and input cell. However, when using spiking neurons, this causes inaccuracies 
between the analog and spiking network, as, due to the variance in the spike- 
based approximation, the gates are never completely closed nor completely open. 
In a recently proposed variation from the original LSTM architecture, called 
subLSTM [1], the typical multiplicative gating mechanism is substituted with a 
subtractive one, not requiring thus for the gates to output values exclusively in 
the range [0, 1]. This allows us to use neurons characterized by a smoothed ReLU 
as activation function. Mathematically, the difference between the integration in 
the CEC in the LSTM and subLSTM is given as: 


LSIM c¿=Cp-1+2Z¿01,, | subLSTM: ct = G@-1+2Z:—i¢, (7) 


with c; value of the memory cell at time t, z; and i; represent the signal coming 
from the input cell and the input gate, respectively. 

As noted, to obtain a working Adaptive Spiking subLSTM, we first train its 
analog equivalent, the Adaptive Analog subLSTM. Figure 2 shows the schematic 
of the Adaptive Analog subLSTM and its spiking analogue: we aim for a one-on- 
one mapping from the Adaptive Analog subLSTM to the Adaptive Spiking sub- 
LSTM. This means that while we train the Adaptive Analog subLSTM network 
with the standard time step representation, the conversion to the continuous- 
time spiking domain is achieved by presenting each input for a time window of 
size At, which is determined by the neuronal parameters and by the size of the 
network. We find that by simply multiplying the signal incoming to the spiking 
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CEC times a conversion factor (i.e. Cr in Fig. 2), the two architectures process 
inputs identically, even if the time component is treated differently. 


Spiking Input Gate and Spiking Input Cell. The AAN functions are used 
in the Adaptive Analog LSTM cell for the input gate, input cell and output cell. 
From the activation value of the input cell the activation value of the input gate is 
subtracted, before it enters the CEC, see Fig. 2. Correspondingly, in the spiking 
version of the input gate, the outgoing signal is subtracted from the spikes that 
move from the ASN of the input cell to the ASN of the output cell. This leads to 
a direct mapping from the Adaptive Analog subLSTM to the Adaptive Spiking 
subLSTM. 


Spiking Constant Error Carousel (CEC) and Spiking Output Cell. 
The Constant Error Carousel (CEC) is the central part of the LSTM cell and 
avoids the vanishing gradient problem [8]. In the Adaptive Spiking subLSTM, 
we merge the CEC and the output cell to one ASN with an internal state that 
does not decay — in the brain could be implemented by slowly decaying (seconds) 
neurons [14]. The value of the CEC in the Adaptive Analog LSTM corresponds 
with state / of the ASN output cell in the Adaptive Spiking LSTM. In the 
Adaptive Spiking subLSTM, we set 7g in Eq.1 to a very large value for the 
CEC cell to obtain the integrating behavior of a CEC. Since no forget gate is 
implemented this results in a spiking CEC neuron that fully integrates its input. 
When 7, is set to oo, every incoming spike is added to a non-decaying PSC J. So if 
the state of the sending neuron (ASNiy in Fig. 3) has a stable inter-spike interval 
(ISI), then I of the receiving neuron (ASNout) is increased with incoming spike 
height h every ISI, so 5 per time step. The same integrating behavior needs to 
be translated to the analog CEC. Since the CEC cell of the Adaptive Spiking 
subLSTM integrates its input S every time step by = we can map this to the 
CEC of the Adaptive Analog subLSTM. The CEC of a traditional LSTM without 
a forget gate is updated every time step by CEC(t) = CEC(t — 1) + S, with S 
its input value (i.e. z¿ — i, for a subtractive LSTM). The CEC of the Adaptive 
Analog subLSTM is updated every time step by CEC(t) = CEC(t-1)+ This 
is depicted in Fig.2 via a weight after the input gate with value >. To allow 
a correct continuous-time representation after the spike-coding conversion, we 
divide the incoming connection weight to the CEC, Wcgc, by the time window 
At. In our approach then, we train the Adaptive Analog subLSTM as for the 
traditional LSTM (without the 7,, factor), which effectively corresponds to set 
a continuous-time time window At = 7,. Thus, to select a different At, in the 
spiking version Wcxc has to be set to Worc = 7, /At. The middle plot in Fig. 3 
shows that setting Tg to oo for ASNout in a spiking network results in the same 
behavior as using an analog CEC that integrates with CEC(t) = CEC(t-1)+5, 
since the slope of the analog CEC is indeed the same as the slope of the spiking 
CEC. Here, every time step in the analog experiment corresponds to At = 40 ms. 
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Fig. 3. A simulation to illustrate how the analog CEC integrates its input signal with 
the same speed as an ASN with Tg = oo provided that the input signal does not change 
and that 1 analog time step corresponds to At = 40 ms (middle). In the right panel, 
the spiking output signal approximates the analog output. 


Learning Rule. To train the analog subLSTMs on the supervised tasks, a 
customized truncated version of real-time recurrent learning (RTRL) was used. 
This is the same algorithm used in [13], where the partial derivatives w.r.t. the 
weights Wx. and Wx; (see Fig. 2) are truncated. For the reinforcement learning 
(RL) tasks we used RL-LSTM [15], which uses the same customized, truncated 
version of RTRL that was used for the supervised tasks. RL-LSTM also incor- 
porates eligibility traces to improve training and Advantage Learning [16]. All 
regular neurons in the network are trained with traditional backpropagation. 


3 Experiments 


Since the presented Adaptive Analog subLSTM only has an input gate and no 
output or forget gate, we present four classical tasks from the LSTM literature 
that do not rely on these additional gates. 


Sequence Prediction with Long Time Lags. The main concept of LSTM, 
the ability of a CEC to maintain information over long stretches of time, was 
demonstrated in [8] in a Sequence Prediction task: the network has to pre- 
dict the next input of a sequence of p +1 possible input symbols denoted as 
@1,--+,Ap—1, Ap = X, Ap41 = Y. In the noise free version of this task, every sym- 
bol is represented by the p+ 1 input units with the i-th unit set to 1 and all 
the others to 0. At every time step a new input of the sequence is presented. As 
in the original formulation, we train the network with two possible sequences, 
(X, 01,Q2,...,4p-1,X) and (y, a1, @2,...,@p—1, Y), chosen with equal probability. 
For both sequences the network has to store a representation of the first element 
in the memory cell for the entire length of the sequence (p). We train 50 networks 
on this task for a total of 200k trials, with p = 100, on an architecture with p+1 
input units and p+ 1 output units. The input units are fully connected to the 
output units without a hidden layer. The same sequential network construction 
method from the original paper was used to prevent the “abuse problem”: the 
Adaptive Analog subLSTM cell is only included in the network after the error 
stops decreasing [8]. In the noisy version of the sequence prediction task, the 
network still has to predict the next input of the sequence, but the symbols 
from a; to a,_ı are presented in random order and the same symbol can occur 
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Table 1. Summary of the results. The number of iterations necessary for the network 
to learn is shown both for the original [8,15] and current implementation. Success- 
fully trained networks (%), ASN accuracy (%) over the number of successfully trained 
networks, total number of spikes per task and average firing rate (Hz) are also reported. 


Task Orig. conv. (%) | AAN conv. (%) ASN (%) | Nspikes (Hz) 

Seq. prediction 5040 (100) 4562 (100) 100 2578 + 18 (129) 
Noisy seq. prediction 5680 (100) 64428 (100) 100 2241 + 22 (112) 
T-Maze M (100) | 15633 (86) 97 1901 + 249 (77) 
noisy T-Maze 1.75M (100) — | 20440 (94) 92 1604 + 216 (65) 


multiple times. Therefore, only the final symbols a, and ap+1 can be correctly 
predicted. This version of the sequence prediction task avoids the possibility that 
the network learns local regularities in the input stream. We train 50 networks 
with the same architecture and parameters of the previous task, for 200% trials. 


T-Maze Task. In order to demonstrate the generality of our approach, we 
trained a network with Adaptive Analog subLSTM cells on a Reinforcement 
Learning task, originally introduced in [15]. In the T-Maze task, an agent has 
to move inside a maze to reach a target position in order to be rewarded while 
maintaining information during the trial. The maze is composed of a long cor- 
ridor with a T-junction at the end, where the agent has to make a choice based 
on information presented at the start of the task. The agent receives a reward 
of 4 if it reaches the target position and —0.2 if it moves against the wall. If it 
moves to the wrong direction at the T-junction it also receives a reward of —0.2 
and the system is reset. The agent has 3 inputs and 4 outputs corresponding to 
the 4 possible directions it can move to. At the beginning of the task the input 
can be either 011 or 110 (which indicates on which side of the T-junction the 
reward is placed). Here, we chose the corridor length N = 20. A noiseless and 
a noisy version of the task were defined: in the noiseless version the corridor is 
represented as 101, and at the T-junction 010; in a noisy version the input in 
the corridor is represented as a0b where a and b are two uniformly distributed 
random variables in a range of [0,1]. While the noiseless version can be learned 
by LSTM-like networks without input gating [17], the noisy version requires the 
use of such gates. The network consists of a fully connected hidden layer with 12 
AAN units and 3 Adaptive Analog subLSTMs. The same training parameters 
are used as in [15]; we train 50 networks for each task and all networks have the 
same architecture. As a convergence criteria we checked whenever the network 
reached on average a total reward greater than 3.5 in the last 100 trials. 


4 Results 


As shown in Table 1, for the noise-free and noisy Sequence Prediction tasks all of 
the networks were both successfully trained and could be converted into spiking 
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Fig. 4. Top panels: output values of the analog (left) and spiking (right) network for 
the noise-free sequence prediction task. Only the last 5 input symbols of the series 
are shown. The last symbol y (black) is correctly predicted both in the last time step 
(analog) and in the last 40 ms (spiking). Bottom panels: Q-values of the analog (left) 
and spiking (right) network for the noisy T-Maze task. At the last time step/40 ms it 
correctly selects the right action (solid gray line). 
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Fig. 5. The values of the analog CECs and spiking CECs for the noise-free sequence 
prediction (left panel) and noisy T-Maze (right panel) tasks. The spiking CEC is the 
internal state S of the output cell of the Adaptive Spiking LSTM. 
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networks. The top panels in Fig. 4 show the last 5 inputs of a noise-free Sequence 
Prediction task before (left) and after (right) the conversion, demonstrating the 
correct predictions made in both cases. In the noisy task, all the successfully 
trained networks were also still working after the conversion. Finally, we found 
that the number of trials needed to reach the convergence criterion were, on 
average, lower than the one reported in [8] for the noiseless task, while much 
higher for the noisy task. Both the training and the conversion resulted harder 
for the T-Maze task, with a few networks non converting correctly into spiking. 
The bottom panels in Fig. 4 show the Q-values of a noisy T-Maze task, demon- 
strating the correspondence between the analog and spiking representation even 
in presence of noisy inputs. In general, we see that the spiking CEC value is 
close to the analog CEC value, while always exhibiting some deviations. Table 1 
reports also the average firing rate per neuron, showing reasonably low values 
compatible with those recorded from real (active) neurons. 


5 Discussion 


Gating is a crucial ingredient in recurrent neural networks that are able to learn 
long-range dependencies [8,18]. Input gates in particular allow memory cells to 
maintain information over long stretches of time regardless of the presented - 
irrelevant - sensory input [8]. The ability to recognize and maintain information 
for later use is also that which makes gated RNNs like LSTM so successful in the 
great many sequence-related problems, ranging from natural language processing 
to learning cognitive tasks [15]. To transfer deep neural networks to networks of 
spiking neurons, a highly effective method has been to map the transfer function 
of spiking neurons to analog counterparts and then, once the network has been 
trained, substitute the analog neurons with spiking neurons [5,6,11]. Here, we 
showed how this approach can be extended to gated memory units, and we 
demonstrated this for a subLSTM network comprised of an input gate and a 
CEC. Hence, we effectively obtained a low-firing rate asynchronous subLSTM 
network which was then shown to be suitable for learning sequence prediction 
tasks, both in a noise-free and noisy setting, and a standard working memory 
reinforcement learning task. The learned network could then successfully be 
mapped to its spiking neural network equivalent for the majority of the trained 
analog networks. Further experiments will be needed in order to implement other 
gates and recurrent connections from the output cell of the subLSTM. Although 
the adaptive spiking LSTM implemented in this paper does not have output 
gates [8], they can be included by following the same approach used for the 
input gates: a modulation of the synaptic strength. The reasons for our approach 
are multiple: first of all, most of the tasks do not really require output gates; 
moreover, modulating each output synapse independently is less intuitive and 
biologically plausible than for the input gates. A similar argument can be made 
for the forget gates, which were not included in the original LSTM formulation: 
here, the solution consists in modulating the decaying factor of the CEC. It must 
be mentioned that which gates are really needed in an LSTM network is still an 
open question, with answers depending on the kind of task to be solved [19, 20]. 
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Abstract. This paper proposes to apply spiking signals to the control 
of an AC motor drive at variable speed in real-time experimentation. 
Innovative theoretical concepts of spiking signal processing (SSP, [1]) 
is introduced using the Ina,» + Ig neuron model [7]. Based on SSP 
concepts, we designed a spiking speed controller inspired by the human 
movement control. The spiking speed controller is then integrated in 
the field oriented control (FOC, [13]) topology in order to control an 
induction drive at various mechanical speed. Experimental results are 
presented and discussed. This paper demonstrates that spiking signals 
can be straightforwardly used for electrical engineering applications in 
real time experimentation based on robust SSP theory. 
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1 Spiking Signal Processing 


1.1 Spiking Transformation 


In previous paper [1], we assumed that a continuous signal x(t) (black curve 
in Fig. 1) could be mathematically transformed by a single neuron into a series 
of spikes ı(t — tn) (greek letter iota) called x,(t) (blue spikes in Fig. 1) and 
representing the exact image of the original continuous signal x(t). 


n=+0oo 


n(t)= Y lt- ta) (1) 
n=0 
x(t) = x(t) (2) 


To ensure equivalence between both signals in (2), we have to set the firing 
frequency v(t) (greek letter nu) of the neuron as the image of the continuous 
signal x(t). 

Valt) = a(t) (3) 


The firing rate or firing frequency v[n] is defined as the frequency between 
two consecutive spikes fired at tn and t„-ı with At[n] being the elapsed time 
© Springer Nature Switzerland AG 2018 
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(Fig. 1). The Eq. (4) represents the firing rate in a discrete form since a fired spike 
can be identified by its position t,,. The next upcoming firing rate is represented 
in a continuous form (5) since time runs until the next potential upcoming spike. 


Vo [n] = 1/Atz[n] =1/(tn — tn-1) (4) 


v(t) = 1/Ate(t) = 1/(t — ini) (5) 


By (4) and (5), we can express the series x, (1) according to its spiking 
elapsed time At (7) or firing frequency v (8). The equations below are different 
ways to express the same spiking series x, (blue spikes in Fig. 1). 


n=O 


z(t) = Y i(t- (tn-1 + Atz[n])) (6) 


Taking into account the assumption (3), we can express the spiking series 
x,(t) as 


n=TX 


x(t)= Y (e) —1/4to(t)) (9) 


n=0 


N=+ 00 


= Y (z(t) — volt) (10) 


n=0 


Despite their different aspects, we will see in real time experimentation that 
the spiking signal x,(t) (blue spikes in Fig. 1) is the approximation of the original 
continuous signal x(t) (black curve in Fig. 1). 


1.2 Accuracy 


In Fig. 1, we see that the firing rate decomposition v,(t) of the spiking series 
x,(t) (blue spikes in Fig. 1) better approximates the signal x(t) at higher signal 
amplitude. Low amplitude of x(t) induces a low firing rate v,(t) and a long 
At. (t) sampling step with a poor approximation accuracy. In order to increase 
the accuracy, we set the constant a in (12) to artificially increase the firing 
frequency v„(t). We call a the sensitivity of the neuron. When the parameter a 
increases, the accuracy increases. 

In order to conserve the energy equality between signals x(t) and x,(t), it is 
necessary to decrease the energy of the spike by the same factor a (red spikes in 
Fig. 1). The spiking series ultimately equals: 
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Fig. 1. Spiking sampling of x(t) with a = 1 (in blue) and a = 3 (in red). (Color figure 
online) 


nt) = 1 Y a(t) —1/adte(@) (11) 
n=0 
1 n=+00 

-1 Y dat) -v.W/a) (12) 
n=0 


Figure 1 shows the firing rate y accuracy with a sensitivity parameter a = 1 
(in blue) and a = 3 (in red). The accuracy of the red curve is enhanced compared 
to the blue one. 

The accuracy principle is equivalent to the movement control principle found 
in the human body. Small and fast accurate movement requires fast muscle 
fibers excited by moto-neurons with small spike amplitude, while high amplitude 
movement requires slow muscle fibers exited by other types of neurons giving 
less accuracy in the movement. This movement control principle will be applied 
in the next chapter in real-time experimentation. 


2 Spiking Signals in FOC Control Drive Experimentation 


In this chapter, we apply SSP theoretical basis to design a spiking speed con- 
troller on real time AC control drive experimentation. After presenting the FOC 
(Field Oriented Control, [8-13]) control strategy and the experimental setup, the 
classical PI speed controller is replaced by a spiking speed controller inspired by 
the human movement control. Experimental results are presented and discussed 
for different control parameters. 
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2.1 Space Vector Modulation (SVM) - Field Oriented Control 
(FOC) 


The principle of the FOC strategy is to allow a decoupling in the control of flux 
and electromagnetic torque such as DC machine does but without the drawbacks 
of high cost of maintenance concerning the usury of commutators and brushes. A 
coordinate frame (d, q) aligned and fixed to the rotor flux allow such decoupled 
torque and magnetization control. 

Figure 2 presents the FOC control structure: 


— the 3-phase stator current measurements (isa, isb, isc) are transformed in (d, q) 
components (isd, isq) through Clarke’s and Park’s transformations 

— the reference rotor flux component (isdref) is kept constant and the refer- 
ence electromagnetic torque component (isq,ref) is generated by a PI Speed 
controller 

— the (d, q) stator current components (isd, isq) are controlled through PI cur- 
rent controllers generating the adequate (d, q) stator voltage (Usa, Usg) to 
apply 

— after Park’s inverse matrix transformation, the stator voltage components 
(Usa, Usg) produce the stator voltage vector to apply to the motor. 


SVM uses a 6 vectors rosace in ordre to rebuild the stator voltage vector. The 
two adjacent rosace vectors are time weighted in a sample period to produce 
the desired output voltage. In conclusion, the input for the SVM is the reference 
stator voltage vector and the outputs are the times to apply each of the IGBT 
transistors of the inverter. The stator voltage vector is electrically produced by 
SVM control technique and supply to the AC motor with the desired phase 
voltages. 


wre 5 e(t) Isq,ref + Vector 
d = Selection 3ph 
Inverse Inverter 
PARK TFO 


AC Motor 


Fig. 2. PI speed controller in FOC AC drive. 
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2.2 Spiking Speed Controller in FOC 


In order to test spiking signals in real-time experimentation, the classical PI 
speed controller [2-6] has been replaced by a spiking speed controller in FOC 
structure (Fig. 3 and [8-13]). The controller receives speed error e(t) as input and 
generates a spiking time series as reference stator current i,sg,ref equivalent to a 
spiking reference electromagnetic torque. The topology of the spiking speed con- 
troller is inspired by the human movement control [1]. The main characteristics 
and parameters are detailed hereunder. 


— The controller uses the reciprocity principle transforming negative speed error 
signal e(t) into positive spiking series e,(t) by different reciprocal group of 
neuron (such as reciprocal muscles in the human arın). 

— The main proportional action loop has an accuracy increasing with the neuron 
sensitivity parameter a. 

— The controller uses a secondary loop called the tremor found in the human 
body. The tremor reflex loop innervates permanently the muscle in order 
to keep the arm bend in a steady state target position. The neurons fire 
permanently and react to any small drift of the reference arm position. The 
neuron has a high sensitivity parameter b which induces fast firing rate of 
small spikes increasing the controllability and the reactivity of the neuron. 
Moreover, an offset has been added in order to permanently fire a spiking 
control signal at steady state speed increasing the motor speed controllability. 

— In the human body, motoneurons innervate low pass filter muscles in order to 
create a continuous action signal. In the control topology of Fig. 3, we use the 
current control loop and the AC drive as final low pass filter organ smooth- 
ing and integrating spike series. The resulting spiking action signal defines 
the spiking reference current 2,sq ref. Spikes are transformed into continuous 
speed w(t) and electric current ¿,s signals through the FOC drive topology 
in the same manner as muscle does. 


The spiking speed controller in FOC topology was coded using the Code 
Composer Studio (CCS rev.5) and implemented in the microcontroller C2000 
Delfino F28035. 

The High Voltage Digital Motor Control (DMC) kit from Texas Instrument 
(TI) [8-12] was used in order to control the AC Induction Motor of 1.5kW at 
1725 rpm nominal speed (Fig. 3) with a number of pair poles p = 2. 


2.3 Experimental Results 


The experimental setup was used to gather experimental results plotted for the 
two presented FOC structures with a PI speed controller (Fig.2) or a spiking 
speed controller (Fig. 3). 


Experimental results are expressed in per-unit (pu) system. It means that 
presented numerical values are fractions of a defined base unit quantity (base). 

Figures 4 (speed) and 5 (electric g-current) present experimental results for 
the PI speed controller of Fig. 2. The PI speed controller has the following control 
parameters: 
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Fig. 3. Spiking speed controller in FOC AC drive. 
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Fig. 4. PI speed controller in FOC AC drive. Reference speed (black line 0); speed (red 
line 1) (Color figure online) 


— proportional gain Kp = 1 
— integral gain Kz =0.04s”!, 


Figures 6 (speed) and 7 (electric g-current) present experimental results for 
the spiking speed controller of Fig. 3. Spiking speed controller has the following 
control parameters: 


— proportional action loop parameter a = 104 
— tremor action loop parameter b = 10° 
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Fig. 5. PI speed controller in FOC AC drive. Reference q-current (black line 0); q- 
current (red line 1). (Color figure online) 
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Fig. 6. Spiking speed controller (a = 10%, b= 10°) in FOC AC drive. Reference speed 
(black line 0); speed (blue line 1). (Color figure online) 


Figure 7 depicts the spiking reference stator current ü,sq,ref composed of its 
proportional (black line 0) and tremor spiking series (grey line 1). The firing fre- 
quency decomposition of the spiking reference stator current is given in Fig. 8. It 
has to be emphasized that, despite the different reference stator current shapes 
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Fig. 7. Spiking speed controller (a = 10%, b = 10°) in FOC AC drive. Spiking reference 


stator current 1,sq,ref composed of proportional loop spikes (black line 0) and tremor 
loop spikes (grey line 1). g-current (blue line 2). (Color figure online) 
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Fig. 8. Spiking speed controller (a = 10*, b= 10°) in FOC AC drive. Firing frequency 


decomposition of the spiking reference stator current i,sq,ref with proportional loop 
(black line 0) and tremor loop (grey line 1). 


of the PI solution (Fig. 5) and the spiking solution (Fig. 7), similar speed control 
performances (Figs. 4 and 6) and stator current shapes (Fig. 9) are observed. Dif- 
ferent stator current shapes in the FOC control structure will give different stator 
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Fig. 9. g-current from spiking speed controller (blue line 0) and from PI controller (red 
line 1) (Color figure online) 
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Fig. 10. Spiking speed controller in FOC AC drive. Reference speed (black line 0); 
Spiking controller with a = 0.5 x 10%, b = 0 (blue line 1); a = 10%, b = 0 (red line 2) 
(Color figure online) 


voltage shape applies to the motor by SVM technique. However, performances 
remain equivalent no matter we use continuous signal (PI speed controller) or 
spiking signal (Spiking speed controller). Those results confirm the signal equiv- 
alence between z(t) and x,(t) expressed in (2). 
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Figure 10 (speeds) compares the control performances of the spiking speed 
controller for different parameters: 


— 1 (blue line) - Spiking speed controller with a = 0.5 * 10%, b = 0 
— 2 (red line) - Spiking speed controller with a = 10%, b = 0 


Figure 10 shows that, even by increasing the parameter a, a spiking controller 
without tremor action b does not reach a zero steady-state error. However, in 
Fig. 6, after adding a tremor action, the speed quickly reaches the reference 
speed. 


3 Conclusion 


This paper applies and demonstrates in real-time experimentation the robustness 
of spiking signal processing principle and formulas. Thanks to SSP theory, clear 
and simple description of spiking controller topology and parameters has been 
realized. Experimental results show that spiking signals are able to achieve com- 
parable control performances than classical continuous signals. Researches are 
now open to apply firing rate decomposition of continuous signal for controller 
design and system identification. 
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Abstract. We evolved spiking neural network controllers for simple 
animats, allowing for these networks to change topologies and weights 
during evolution. The animats’ task was to discern one correct pattern 
(emitted from target objects) amongst other different wrong patterns 
(emitted from distractor objects), by navigating towards targets and 
avoiding distractors in a 2D world. Patterns were emitted with variable 
silences between signals of the same pattern in the attempt of creating 
a state memory. We analyse the network that is able to accomplish the 
task perfectly for patterns consisting of two signals, with 4 interneurons, 
maintaining its state (although not infinitely) thanks to the recurrent 
connections. 


Keywords: Spiking neural networks - Temporal pattern recognition 
Animat : Adaptive exponential integrate and fire 


1 Introduction 


Brains process information through generating and recognizing temporal activity 
patterns of neurons [1,3,5,7,8,11,14]. Neuronal spike trains carry information 
about the environment received through different modalities, including audition 
[10], olfaction [9], and vision [20]. Neurons perform temporal pattern recognition 
of sensory neuron activity in order to decode this information [3,6], which in 
turn requires temporal storage of stimuli or maintenance of an internal state 
[12, 16-19]. 

Several evolutionary neural-driven robotic models have attempted to repro- 
duce insect phonotaxis, that is, movement based on the temporal recognition of 
sound [4,13,15,21]. The abstract task explored in this and our previous paper 
[2] was inspired by phonotaxis in the sense that animats had to navigate towards 
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target objects emitting one simple temporal pattern of signals (which could rep- 
resent sounds, scents, or flashes of light) and avoid distractor objects, which 
emitted other patterns. In contrast to the previous work [2], here we allowed the 
silences between letters of the same pattern to vary, hoping for the emergence 
of a state maintenance mechanism in evolution. We allowed all types of con- 
nections between interneurons and an unlimited size of the network (although a 
relatively small number of generations during artificial evolution did not allow 
the networks to grow very much). 

Our long-term goal is to understand how small networks can accomplish non- 
trivial computational tasks (here, control of foraging that depends on temporal 
pattern recognition) with robustness against noise (on input and/or state vari- 
ables) and damage (variation of the parameters of the environment, the animat, 
and/or neurons). 


2 The Model 


We used the platform GReaNs [22] to evolve networks of adaptive exponential 
integrate and fire neurons with tonic spiking, as we did previously [2]. In con- 
trast with our previous work [2], in the experiments described here, during both 
evolution and testing, (i) the objects did not reappear after collection, (ii) they 
were placed at a random position as before, but imposing a minimum and maxi- 
mum distance (10 and 45 times, respectively, the radius of the animat) from the 
starting point of the animat, (iii) the world was open (not toroidal as previously 
during evolution), (iv) the duration of silences between the signals of the same 
pattern was drawn from a Poisson distribution with A = 30 ms (the length of a 
signal (letter) remained fixed, at 10ms; the silences between the patterns also 
remained fixed, at 150ms), (v) the intensity of signals remained constant for 
both signals of the pattern (this makes the task more difficult—as the distance, 
measured considering the position of the animat at the start of the pattern, 
changes as the animat moves—but simplifies the analysis of how the network 
solves the assigned task). As previously, the intensity of the signals encoded the 
direction to the source of the pattern, as meer where Sp (Sz) is the 
distance between the source a point on the right (left) side of the animat; thus 
if the source is on the left, the value is above 0.5, and if it is on the right, it 
is below 0.5. Also as previously, the patterns consisted of two signals; we will 
refer to patterns sent by the target as AB, while the distractor sends (wrong) 
patterns (ba, aa, bb). 

We carried out 200 independent runs (population size 300, size-2 tournament 
selection, no elitism), aiming to minimise the fitness function: 


T-2D a 


itness = 1 
ffit ( N vn 


), (1) 


where T is the number of targets collected; D is the number of distractors col- 
lected; N is the total number of targets that can be collected if the animat moves 
at maximum speed (which was 1 target during evolution); c is O for the first 100 
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generations and 0.5 for remaining generations; a is the length of a straight line 
connecting the start position of the animat to the position of the target and ß is 
the length of the path made by the animat from the start position to the target 
(if it is collected) or to the last position of the animat during simulation. This 
reward term promotes directional movement toward target rather than circu- 
lar motion at top speed often seen when this term was omitted (such circular 
motion allows to hit the target without directional movement). Since only one 
target and one distractor was present during evolution, the lowest (best) possible 
value for this fitness function was —0.5 and the highest (worst) was 3. 

The mutation parameters used here are the same as in [2], but with a differ- 
ence in the rate of duplication (0.002 per genome instead of 0.001 in our previous 
work) and the rate of deletion (0.001, instead of previously 0.0001). Although 
the duplication rate was twice the deletion rate, the networks did not grow large 
even though (in contrast to [2]) we allowed for an unlimited number of nodes in 
the network. 

Each evolutionary run had 2500 generations, if an animat with fitness below 
—0.15 was detected, the run was stopped after additional 50 generations. In 
each generation, we evaluated first over 15 random worlds (worlds with different 
positions of objects and orientation of the animat) and 200 patterns for each 
world. Each target emitted pattern AB, and each distractor emitted a mixture 
of the other (wrong) patterns (with equal probability): aa, ba, and bb. The 
percentage of the correct pattern occurrence for these 15 evaluations (per Aa) 
was 30%. Then the animat was evaluated over additional 3 worlds, with perag = 
50%, but this time with the distractor emitting always a specific wrong pattern 
(for example, only aa). Finally, the 18 values of the fitness function obtained for 
18 maps were averaged. 


3 Results and Discussion 


3.1 The Efficiency to Discern Correct Pattern 


Out of 200 runs, 20 ended with fitness < —0.25. The 20 champions were ranked 
by their ability to collect targets and avoid distractors. The best champion out 
of the 20 (the winner) was a perfect recogniser such that T = 1000 in 1000 maps 
while D = 0. The winner had 4 interneurons, and showed a directional walk 
without circling (Fig.1), which made it amenable to analysis. The networks’ 
sizes of the remaining 19 champions ranged from 3 to 12 interneurons. In the 
rest of the paper, we will focus on the winner and analyse its performance and 
the underlying mechanism. 

When the winner was tested for various frequencies of the correct pattern, we 
kept the number of occurrences of ABs constant, at 60. Even at 1% frequency 
of the correct pattern, the winner collected the target in 700 worlds out of 1000, 
while it could collect between 960 and 1000 for percentages greater than 10%. 
Thus, even for large numbers of wrong patterns, the task could still be achieved 
with high precision. 

The winner was also tested for a wide range of durations of silent interval 
between letters (Fig. 2; in these tests, unlike during evolution, the duration of 
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end 


Fig. 1. Visualisation of the performance of the winner. The test world has 6 targets 
(black circles) and 6 distractors (red squares), the black swiping line is the movement 
trajectory of the animat. (Color figure online) 


the interval was kept constant). When this interval was between 17 and 37 (close 
to the mean of the Poisson distribution that was used during evolution, 30 ms), 
the animat behaved almost perfectly. Longer duration of silences between letters 
led to an abrupt drop in performance, and no targets were collected for silences 
45ms and above. The loss of performance for values smaller than 17ms was 
mainly due to whether or not a spike coming from neuron NO coincided with 
another spike from neuron N1 (see Fig.3 and Sect. 3.3), for a target on the left. 
Such a spike coincidence is necessary to trigger neuron N2 to spike, which in 
turn makes the animat turn left and thus to forage in a swiping fashion: going 
left-right-left-..., but keeping an overall direction towards the target (Fig. 1). 
On the other hand, when there was no silence at all between letters of the same 
pattern, the spike coincidence allowed the animat to still collect 346 targets. 

Furthermore, long silences between patterns did not affect the performance of 
the animat. Even when they were 1000 ms long, the animat still collected much 
more targets than distractors (the ratio between the average number of targets T 
and distractors D hit over 1000 maps, T/D, was above 50). However, the number 
of collected targets decreased for silences below 40 ms—yet no distractor was hit 
during this test. Any decrease in the length of a stimulus (less than 10 ms) let 
to a circling movement around the starting point or to no movement at all. 
On the other hand, an increase to up to 13ms allowed for a fair recognition 
(T/D = 997/128 = 7.79). 


3.2 Robustness of the Winner 


When we increased the number of objects equally for targets and distractors 
(up to 10 each), T/D remained within the range 12-18. In order to test the 
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Fig. 2. The number of targets collected by the winner over 1000 random worlds, for 
various duration of silence interval between letters within the same pattern. 


efficiency of the animat in foraging one target while avoiding a large number 
of distractors, we set the number of targets to 1 and distractors to 20 and 
evaluated the animat's performance with 700 patterns (140 s simulation time) 
on 1000 worlds. The animat collected 999 targets out of a total of 1000, and only 
170 distractors out of a total of 20000. 

We then investigated the robustness of the behavior to changes in the param- 
eters of the animat. The winner showed robustness to changes in actuator forces; 
a two-fold increase of both actuator forces resulted in T/D = 1000/44 = 22.73. 
Surprisingly, even when we increased the forces 33 times, the animat could still 
maintain its discrimination ability (T/D = 889/85 = 10.46). 

Furthermore, we investigated the ability of the animat to recognise patterns 
with a larger number of signals in a pattern. We tested AAB being sent from 
the target, inter-spaced with wrong patterns (ba, aa, bb). This gave T/D < 1. 
In contrast, for all patterns that started with A, followed by any number of Bs, 
the animat showed the ability to perform pattern recognition such that T/D 
was between 6 and 19. This could be explained by the way the network of the 
winner solves this task (Sect. 3.3). 

Next, we perturbed the inhibitory gain of synapses, and obtained T/D = 
1000/77 = 12.99 for a gain value of 0.5 nS (6 times less than the default value, 3 
nS), whereas for 6 nS (2-fold increase) the quotient was 9.9 (T = 1000, D = 101). 
However, any slight change in the excitatory synaptic gain resulted in T/D < 1. 

The controller of the winner was not robust to any change of neuronal param- 
eters apart from the change of V,. When the V, (reset voltage) was changed to 
a value 3 mV less/more then the default value (from —58 mV to —55mV or 
—61 mV) for all neurons in the network, the winner showed a small but observ- 
able preference for correct patterns over wrong patterns by collecting more than 
twice as many targets as distractors (T/D = 2.65). 
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3.3 Analysis of the Network 


The key mechanism that causes the animat to adapt the direction towards the 
target while navigating is the sustained activation of interneuron N2 (see Fig. 3). 
The network holds a state that indicates A was received by sustained (thanks to 
a self-loop) firing of NO. When the network in such a state receives B from the 
left (the input is at a high level), first N1 spikes once, which together with the 
activity of NO causes sustained (again, thanks to a self-loop) spiking of N2. AL 
can only spike thanks to the sustained spiking of NO and a spike from N3. This 
N3 spike is triggered by SB (the input to the network that presents the signal B). 
The animat turns left only in this situation, because AR receives an excitatory 
connection only from N2. AR is the output (motor) neuron that controls the 
right actuator, so if its spikes outnumber AL spikes, the animat turns left. On 
the other hand, when the pattern is received from the right (the level of both 
inputs is low), N2 remains quiescent, and so does AR. N2 does not spike in this 
case—it does not receive any spike from N1, because the input is too low to 
cause N1 to spike. 

To sum up, when the animat receives the pattern AB from the left side, it 
keeps turning left until the position of the animat changes enough for it to be 
in a location where the pattern is received from the right side, then it turns 
right, until again the signals in the correct pattern are received from the left. 
The alternation of these two movements results in an overall navigation towards 
the target, while ignoring all wrong patterns. Though the network can be in the 
state indicating A was received, wait for B, it does not produce a different action 
when more than one B is received after A (this scenario was not encountered 
during evolution). 

The response of the network to the wrong stimuli (Fig.3) agrees with the 
explanation above. Motor neurons are active only for one wrong pattern, aa. NO 
starts sustained spiking when a is received (SA active), making AL spike and 
thus the animat turns right. N2 is silent because of the absence of activity in 
SB, therefore AR does not spike. So whether the distractor is placed on the left 
or on the right, the animat turns in a clockwise circle. 

For ba, N2 does not spike, despite sustained spiking of NO, and—in the case 
of high stimulation from inputs, i.e. the source is on the left—N1 and N2 spiking 
once each. The reason is that when SB activation precedes SA activation, N1 
spikes before NO, and thus N2 cannot spike. When the source is on the right, 
N1 cannot spike due to low stimulation from SB, and the final output behavior 
is similar to when the source is on the left. For bb, the absence of SA activity 
causes NO to be quiescent, thus none of the two actuators can spike. 

We then investigated the network responses to the target stimulus AB when 
presented with different sensory input levels. For a very low input level, there is 
no activity of interneurons and output neurons. The animat turns right only for 
input levels <0.5 (that is, when the target is on the right). Greater values result 
in a larger number of spikes in AR than in AL, thus causing a left turn of the 
animat. The activity of interneurons for different input level values is similar to 
the activity showed in Fig.3. NO is the only interneuron that starts firing after 
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Fig. 3. The network topology of the winner and spike trains of the four patterns for two- 
level sensory inputs. Excitatory links are shown with arrow heads (orange), inhibitory 
are bar-headed (cyan). The voltage traces (vertical axes: voltage in mV, horizontal 
axes: time in ms) in green are the responses for signals coming from the right, red 
traces for signals coming from the left. (Color figure online) 
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receiving an activation from SA for higher values, and sustains activity until B 
is received. This is followed by the activity of neuron N3, right after receiving 
B, and only then AL spikes. N2 is triggered after a high rate spiking activity of 
NO, and does not require—for high values of input—any longer the coincidence 
of NO and N1 spikes shown in Fig. 3 (Fig. 4). N2 starts spiking before N1 due to 
the stimulation from SB. As long as N2 spikes, AR spikes as well, at a higher 
rate for higher input values, hence outnumbering the number of spikes fired by 
AL and pushing the animat to turn left when the source is placed on the left. 
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Fig. 4. The response of the network to the correct pattern (AB) and to one wrong 
pattern (aa). The raster plots are presented separately for interneurons and outputs. 
The sensory nodes are active during the time slots shaded in gray, at a constant level 
over a full range (vertical axis: gray horizontal lines are at the level 0.45 and 0.55, red 
line at 0.5; horizontal axis: simulation time covering all spikes of one stimulus) (Color 
figure online) 
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For the other wrong stimuli, similar output activities as in Fig. 3 can be observed 
for a wider spectrum of input levels, for example aa results in only one actuator 
(AL) being active, regardless of source location (Fig. 4). 


A Conclusions and Future Work 


We have evolved a simple simulated robot governed by a very small neural 
network, which is capable of achieving a simple yet non-trivial temporal pattern 
recognition task while foraging in a 2D world. The evolved network showed a 
maintenance of state for a finite time, which was based on recurrent connections 
within the network. Although the network was robust against variation of silences 
between letters and between patterns during the simulation of the animat, it 
was not robust against changes in neuronal parameters. In the future, in order 
to explore the robustness of such small networks, we plan to evolve the network 
in the presence of voltage noise. Preliminary results (not covered in this paper) 
show that such networks are much more robust to alterations of some of the 
neuronal parameters. 
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Abstract. Multiplicative or divisive changes in tuning curves of indi- 
vidual neurons to one stimulus (“input”) as another stimulus (“mod- 
ulation”) is applied, called gain modulation, play an important role in 
perception and decision making. Since the presence of modulatory synap- 
tic stimulation results in a multiplicative operation by proportionally 
changing the neuronal input-output relationship, such a change affects 
the sensitivity of the neuron but not its selectivity. Multiplicative gain 
modulation has commonly been studied at the level of single neurons. 
Much less is known about arithmetic operations at the network level. In 
this work we have evolved small networks of spiking neurons in which 
the output neurons respond to input with non-linear tuning curves that 
exhibit gain modulation—the best network showed an over 3-fold multi- 
plicative response to modulation. Interestingly, we have also obtained a 
network with only 2 interneurons showing an over 2-fold response. 


Keywords: Gain modulation - Multiplicative operation 
Spiking neural network - Artificial evolution 
Adaptive exponential integrate and fire 


1 Introduction 


Multiplicative or divisive changes in a tuning curve of individual neurons to one 
stimulus (here, input) as another stimulus (here, modulation) is applied, called 
gain modulation, are thought to play an important role in neural computation 
[6,7,10,12]. Gain modulation has been observed in the neurons responsible for 
keeping stable course during flight in domestic flies [4], and in the auditory sys- 
tem of owls [9] and crickets [5]. In the mammalian brain, neurons in cortical and 
subcortical regions vary their output response in a multiplicative fashion relative 
to a background modulatory synaptic input [7,10]. In this scenario, information 
is aggregated from different stimuli and the output response is modulated so 
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Fig. 1. Target tuning curves. The curves, generated with Eq. 5, show the expected 
number of spikes of the output neuron for different levels of input and different levels 
of modulation, which correspond to different multiplier, m. 


that a change in the slope of input-output firing rate is produced. The control of 
the movement of the eyes and hands, the perception of visual information, and 
motor memory are other examples where gain modulation was observed [11]. 

Since gain modulation is an operation observed in single biological neurons, 
previous studies on modeling such multiplicative operations focused on models 
of single neurons (e.g., [1,3]). Here, we take a different approach—we evolve 
a network of simple neurons (adaptive exponential integrate and fire neurons 
[2]) to perform multiplicative operations inspired by the operations performed 
by single neurons—the fitness function in artificial evolution rewarded the net- 
works that had a non-linear response (firing rate of the output neuron) to input, 
varying proportionally to different levels of modulation (and therefore matching 
the tuning curves in Fig. 1). 


2 The Model 


The adaptive exponential integrate and fire [2] neuronal model has four state 
variables and a number of parameters (Table 1; the parameters we used result 
in tonic spiking for constant input [8]): 


eae = gr(Eır -V)+ge(Es — V) 
+gL(Er — V) + gL Ape ar" 40 (1) 
Tw A =a(V —- Er) -w (2) 
=e B 
el (4) 


dt Tm 
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Table 1. Neuronal parameters used in this paper 


Parameter Value 
Tg/r excitatory/inhibitory time constant | 5 ms 
Gyr excitatory/inhibitory synaptic gain | 3n$ 

gL total leak conductance 10nS 

Er effective rest potential —70mV 
Er inhibitory reversal potential —70mV 
Eg excitatory reversal potential 0 mV 
Ar threshold slope factor 2mV 
Vr effective threshold potential —50 mV 
C total capacitance 0.2nF 

a adaptation conductance 2nS 

b spike-triggered adaptation OpA 

Tw adaptation time constant 30 ms 
V, reset voltage —58mV 
Vin spike detection threshold OmV 


A spike is generated when the voltage (V) crosses a threshold (V > Vin); 
when this happens V takes a value V,, and the adaptation (w) takes a value 
w + b. When a neuron receives a spike from a neuron that connects to it (a 
presynaptic neuron), the excitatory (inhibitory) conductance is increased by an 
excitatory (inhibitory) gain multiplied by the synaptic weight. In this paper, we 
use Euler integration with 1 ms time steps. 

Each network, in addition to interneurons and the output neuron (all neurons 
had the same parameters), has two nodes: input and modulation nodes, which 
can be connected to interneurons but not to the output neuron. If an interneu- 
ron receives an excitatory (inhibitory) connection from the input or modulation 
node, at each time step the interneuron’s excitatory (inhibitory) conductance 
is increased by the value of the input or modulation (a value between 0 and 1) 
multiplied by the excitatory (inhibitory) synaptic gain and the weight of the 
connection. In other words, each interneuron can receive stimulation (excitatory 
or inhibitory) from other interneurons (or, indeed, itself), and input or modula- 
tion nodes. But whereas each spike from a neuron in the network results in an 
increase of the excitatory or inhibitory conductance by a value proportional to 
the synaptic weight, the increase in the case of the stimulation from input or 
modulation nodes is also proportional to input or modulation. 

Both the input and modulation are presented for 240ms, and the network 
response (the spikes of the output neuron) is measured at the same time. In 
order to avoid that the response to one pair (input, modulation) would affect a 
response to another pair, all the neurons in the network are reset to their initial 
state (V = Ez, w = 0, gg = 0, gr = 0) after each pair is presented to the 
network. 
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We used the platform GReaNs [13,14] to evolve the networks. In GReaNs, 
networks are encoded in linear genomes (Fig. 2); the encoding is inspired by 
the way biological genetic networks are encoded in biological genomes [14]. In 
principle, the number of interneurons encoded in the network and the number 
of links between them is unlimited (here, for reasons of computational efficiency, 
we limited the number of interneurons to 10; in practice, the networks did not 
grow during evolution beyond 7 interneurons). Each interneuron is encoded by 
a series of cis and trans genetic elements; input, modulation, and output nodes 
are encoded by a genetic element each. Each genetic element has an associated 
point in an abstract 2-dimensional affinity space. To determine the connectivity 
of the network, first the Euclidean distance between each ordered pair of points 
(trans, cis), (input, cis), (modulation, cis), and (trans, output) is obtained. This 
distance translates to a contribution to the weight of the connection between 
nodes in the network. Since each interneuron can be encoded by several cis 
and trans elements, the weight contributions are summed. If the distance is 
above a certain threshold, the contribution is zero. Otherwise, the contribution 
is an inverse exponential function of the distance. The sign of the contribution 
(positive or negative) is determined by the sign associated with each genetic 
element—if both signs are the same, the contribution is positive, otherwise it 
is negative. A positive (negative) sum of contributions results in an excitatory 
(inhibitory) link in the network. 

Each independent run of artificial evolution was limited to 2000 generations, 
with a constant population (300 individuals), elitism (10), and size 2 tourna- 
ment selection. The initial population consisted of random genomes created as 
described previously [15], and the genetic operators were exactly the same as in 
this previous work [15]. 

The fitness function rewarded the correct number of spikes of the output 
neuron in response to a given pair of input and modulation. In other words, 
the networks were evolved so to match target tuning curves (Fig. 1), which were 
generated using the equation 


35 
1 + (exp(-8I + 4) 


T(I,m) = m x ( ), (5) 


where J is input and m is the multiplier, expected to be twice the modulation dur- 
ing evolution (for example, modulation of 0.2 is expected to produce the output 
corresponding to m = 0.4). The constant parameters in Eq.5 were selected by 


- 1 -m 


Neuron 1 Neuron 2 \ 


Fig. 2. The encoding of network in linear genome. Genetic elements I and M code for 
the nodes that allow for presenting the stimulation by input and modulation, O encodes 
the output neuron. See text for further details. 
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Fig. 3. Absolute error of the evolved networks. Both the network with six interneurons 
(champion 1, left) and two (champion 2, right) show the absolute error at most about 
2 when compared to the best-fit tuning curves across 8 modulation levels experienced 
during evolution (filled symbols) and 6 intermediate values (empty symbols) used only 
in testing. 


hand to give a clearly non-linear response (similar to the tuning curves observed 
in [3]) with biologically realistic firing rates, and so that the network with a 
perfect response would show a 5-fold multiplicative response (this is the ratio of 
the highest multiplier divided by the lowest). 

The fitness function minimises the sum of absolute differences between the 
target and observed responses (absolute error), averaged over all np pairs of 
input and multiplier: 


1 n p 
Fritness = z7 AOX IT, my) — Oe, m)| (6) 


k=1 l=1 


During evolution we presented 209 pairs of input and modulation, n = 19 
levels of input, from 0.1 to 1.0, 0.05 apart, times p = 11 levels of modulation, 
from 0.2 to 0.8, 0.1 apart, and from 0.85 to 1, 0.05 apart. 


3 Results and Discussion 


We have run 100 independent runs of artificial evolution, and then analysed the 
responses of the champion networks. During this analysis, we have noticed that 
none of the champions shows good responses for input values above 0.8, and 
modulation levels above 0.85. We will aim to resolve this problem in our future 
work. For the preliminary analysis in this paper, we have thus retested all the 
champions for 15 input values (from 0.1 to 0.8, 0.05 apart) and 14 modulation 
values, 8 presented during evolution (from 0.2 to 0.8, 0.1 apart, plus 0.85) and 
additional 6 values from 0.25 to 0.75, 0.1 apart. 

Since we are interested in evolving networks for multiplicative operations in 
general, not networks whose responses match a particular set of tuning curves, 
we fitted (using least squares as the goodness-of-fit measure) the parameter m 
in Eq.5 to the actual responses for these 14 levels of modulation, each over 
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these 15 levels of input. Only two networks out of 100 had an absolute error 
(averaged for all input levels for a given modulation level) of around 2 (Fig. 3) 
for all 14 modulation levels—at most about 2 spikes difference from the best-fit 
tuning curve. One of this networks (champion 1) had six interneurons, the other 
(champion 2) had two—the genome coded for three, but one of the neurons did 
not spike for any (input, modulation) pair and thus could be removed. 

The ratio of the highest to the lowest m value (Mmaz/Mmin) fitted as 
described above for each network gives the maximum number by which the 
network actually multiplies its response to the input as the modulation varies. 
This ratio was 3.4 for champion 1 (the fitted value of m was Mmin = 0.54 for 
modulation = 0.2 and Mmaxz = 1.82 for modulation = 0.8) and 2.3 for champion 
2 (Mmin = 0.74, Mmac = 1.71). Thus the larger network showed a larger mul- 
tiplicative response to modulation (Fig.4). Since in 100 runs we obtained only 
two networks with small absolute errors, we cannot judge yet if in general larger 
networks thus obtained will show larger multiplicative responses. We plan to 
investigate this in our future work. 

The networks of both champion 1 and 2 had only excitatory connections. 
Perhaps if we reformulated the task so that the networks were not reset after 
each (input, modulation) pair, also inhibitory connections would be necessary. 
This is another issue we plan to investigate in our future work. 

The preliminary analysis of the activity in the smaller network (Fig. 5) shows 
that while the firing rate of interneuron N1 is well above 100 Hz, the firing rate 
of NO is lower, around 30 Hz, and NO does not respond to low modulation and 
input levels. The activity of N1 is very similar to the activity of the output; 
the most noticeable exceptions are in the response to high levels of input, which 
in N1, unlike in output, does not change much with the varying modulation. 
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Fig. 4. The best-fit tuning curves of the evolved networks for three levels of modulation. 
The tuning curves correspond to the best fit of m (see text of details), for the modulation 
at the lowest (0.2), intermediate (0.6) and the highest level (0.8). The network with six 
interneurons (champion 1, left) shows a 3.4-fold multiplicative response, the network 
with two interneurons (champion 2, right) shows a 2.3-fold response. 
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We can speculate that the connection from NO to N1 together with a self- 
connection of N1 is what allows the varied response to high input as the modu- 
lation varies. 
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Fig. 5. The topology of the network with two interneurons with the responses of the 
output neuron and the interneurons. 


A Conclusions and Future Work 


We have successfully evolved a network of six plus one (interneurons plus output) 
adaptive exponential neurons with a tuning curve that can be scaled multiplica- 
tively 3.4 times, and a network with only two plus one such neurons with a 
curve scalable 2.3-fold. In future work, our focus will be the absolute error in 
the network response, so that we can scale up to higher multiplicative values. 
We also plan to investigate if networks performing multiplicative operations can 
be evolved in the presence of noise on the input and modulation and on the 
state variables of the neurons. We would like to test the hypothesis if larger net- 
works will allow for better matching of tuning curves and higher multiplicative 
responses. We would also like to reformulate our model so not only the firing 
rates of the neurons in the network are kept within the biologically realistic val- 
ues, but also the currents resulting from the stimulation—this was not the case 
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here, as the synaptic weights in the evolved networks reached very high values; 
we plan to limit them by introducing an additional sigmoidal transformation in 
the encoding of synaptic weights in our model. 
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Abstract. We evolve both topology and synaptic weights of recurrent 
very small spiking neural networks in the presence of noise on the mem- 
brane potential. The noise is at a level similar to the level observed in 
biological neurons. The task of the networks is to recognise three signals 
in a particular order (a pattern ABC) in a continuous input stream in 
which each signal occurs with the same probability. The networks con- 
sist of adaptive exponential integrate and fire neurons and are limited to 
either three or four interneurons and one output neuron, with recurrent 
and self-connections allowed only for interneurons. Our results show that 
spiking neural networks evolved in the presence of noise are robust to 
the change of neuronal parameters. We propose a procedure to approx- 
imate the range, specific for every neuronal parameter, from which the 
parameters can be sampled to preserve, at least for some networks, high 
true positive rate and low false discovery rate. After assigning the state 
of neurons to states of the network corresponding to states in a finite 
state transducer, we show that this simple but not trivial computational 
task of temporal pattern recognition can be accomplished in a variety of 
ways. 


Keywords: Temporal pattern recognition - Spiking neural networks 
Artificial evolution - Minimal cognition - Complex networks 
Genetic algorithm - Finite state automaton - Finite state machine 


1 Introduction 


Information in biological neuronal systems is represented temporally by pre- 
cise timing of voltage spikes [1,3,5,6,12, 13,15]. Thus noise poses a fundamental 
problem for informational processing in biological systems [9] (and also artificial 
systems inspired by them). On the other hand, noise has been postulated to play 
a computational role [14]. For example, neuronal noise enables the phenomenon 
of stochastic resonance in neural networks—a process in which a weak signal 
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gets amplified to reach a threshold, or a strong signal is prevented from spiking 
[7,20,21]. Moreover, neural networks formed in the presence of background noisy 
synaptic activity can be expected to be robust to disturbances [11]. 

In this work, we analyse very small spiking neural networks (SNNs) evolved to 
perform a simple temporal pattern recognition task in the presence of noise. We 
will show that networks evolved with noise maintain functionality even when the 
parameters of the neuronal model are changed. In contrast to our previous work 
[23] in which just one neuronal parameter was varied at any given time (while 
all the other parameters were kept at the default value), here we investigate 
the robustness against varying all the parameters simultaneously. Although the 
model for evolving the topology and weights in the SNNs we use here does not 
in principle limit the number of neurons, we limited this number to either three 
or four interneurons and one output neuron. 

It has been observed before that the same computational task can be accom- 
plished by networks with different structures [16,19]. Our long-term goal is 
to understand how various solutions—obtained by evolving networks numer- 
ous times, independently—can accomplish simple, but not trivial computational 
tasks. 


2 The Model 


The networks in this work consist of adaptive exponential integrate and fire 
neurons [17] with the default values of the parameters that result in tonic spiking 
for constant input. The four state variables of each neuron, membrane potential 
V, adaptation w, excitatory and inhibitory conductance gg and gz, are governed 
by the equations 


dV 1 
ae c\ge(Ee —V)+gr(Er-— V) -— w) 
1 Vv=Vp 
+—(Er- V + Are ar >) (1) 
dw 
Tag a(V-E,)-w (2) 
dge _ —JE 
an ae (3) 
dgr GI 
ia u 4 
dt TI ( ) 


with 13 parameters in total; the default values of parameters are presented in 
Table 1 [23,24]. We used Euler integration with 1ms time step, and added a 
random value drawn from the normal distribution centered at 0 with standard 
deviation 2 mV to V at each step; this level of noise is similar to that observed 
in biological neurons [2,8,10,18]. 

When V of a neuron is above 0 mV, V is reduced to V,, while w changes 
to w + b, and each neuron to which this neuron connects receives a spike. If 
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Table 1. The ranges of robustness for champions with 3 and 4 interneurons that 
were most robust (3/3 and 1/4, respectively) overall and for the most robust from the 
champions maintaining state (8/3 and 7/4). 


Parameter | Default value 3/3 8/3 1/4 7/4 

EL —70mV 72, —67] 72, —66] 72, —67] 74, —68] 
V, -58 mV 59, —55] | [-60, —54] | [-60, —55] | [-59, —55] 
Vr -50 mV 51, 48] | [-51, —48] | [-52, -49] | [-51, —48] 
Ar 2mV 16,24 [18,23] [18,21] |[1.9, 2.2] 
C 0.2 nF 0.19, 0.22] | [0.17, 0.23] | [0.17, 0.21] | [0.17, 0.22] 
a 2nS -2, 4] 1, 6] 0,3] 1,4] 

b OpA 0, 3] 0, 4] 0, 3] 0,2] 

Tm 20 ms 19, 22] 18, 23] 17, 21] 17, 23] 
Tw 30 ms 29, 32] 29, 33] 27, 31] 27, 31] 
TE 5ms 4.8, 5.2] 4.9, 5.3] 4.7, 5.1] 4.9, 5.3 
TI 5 ms 4.9, 5.2] 4.9, 5.3] 4.6, 5.1] 4.9, 5.3 
En OmV —2, 2] —2,4] si =; 

Er —70mV 71, —67| 73, —68] 72, —67] 71,68 
gaine TnS 6.9, 7.3] | (6.9, 7.3] |[6.8,7.3) |[6.7, 7.2 
gainı TnS 6.8, 7.3] 6.8, 7.4] 6.8, 7.3] 6.7, 7.2 


the connection is excitatory (inhibitory), gg (gr) in such a postsynaptic neuron 
is increased by the weight of the connection multiplied by the synaptic gain. 
Encoding of SNNs in our model has been described previously [22-24]. In order 
to recognise a subsequence of three signals in a random input stream, the network 
has three input nodes (one for each signal), either three or four interneurons, and 
a single output neuron. Dale rule [4] is not kept—a neuron can be both excitatory 
and inhibitory at the same time. Furthermore, input nodes cannot connect to the 
output neuron directly. Only interneurons can have self-loops. The settings for 
the artificial evolution in this work are as in our previous work [23], with three 
modifications: (i) the size of duplication of genetic elements was drawn from 
a geometric distribution with mean 6 (it was 11 previously), (ii) the elements 
coding for input and output were excluded both from duplications/deletions 
and crossover (they were allowed to undergo crossover in [23]), (iii) finally and 
most importantly, we modified slightly the way the fitness function is calculated, 
resulting in the procedure as follows. 

During evolution, each individual was evaluated on six input streams with 
500 signals, each signal 6ms in duration and followed by 16ms silence (each 
input stream thus lasted for 11s). In four input streams, all signals (A, B and 
C) occurred with equal probability; two input streams were constructed by con- 
catenating four triplets (with equal probability of occurrence): ABC and ABA, 
ABB, BBC (three triplets that our preliminary work showed the most problem- 
atic to distinguish from the pattern to be recognised, ABC). To calculate the 
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fitness function, we calculated R (for reward), the number of 22 ms intervals 
(signal plus silence) of the last C of each ABC in the input sequence during 
which the output neuron actually spiked, correctly, at least once, divided by the 
total number of intervals in the input stream for which it should spike. In other 
words, R is the true positive rate (TPR) of the network. We also calculated 
P (for penalty), the number of other 22 ms intervals (signal plus silence) with 
spikes on output (wrongly), divided by the total number of 22 ms intervals in 
the input stream in which spikes should not occur. In contrast, false discovery 
rate (FDR) of the network has the same numerator as P, but the denominator 
is all the 22 ms intervals in which the spikes of the output neuron were observed. 
The fitness function we used, 


fitness =1-R+4P (5) 


penalises strongly spikes that do not follow the target pattern. The constant 4 in 
the penalty term was chosen by the preliminary exploration of values with the 
objective to find a value that gave the highest yield of successful evolutionary 
runs. We define a successful run as one that ends with a champion that is a 
perfect recogniser. A perfect recogniser evolved without noise is a network that 
spikes only after the correct pattern. For networks evolved with noise, we consider 
an SNN a perfect recogniser if it has TPR >0.99 and FDR <0.01). 

The slight modifications of the settings of the artificial evolution (from the 
ones used in [23]) had a quite pronounced effect on the yield of perfect recognisers 
when no noise was present (for three interneurons, 81% of runs versus 33% for 
the settings in [23]). However, the effect on the evolvability in the presence of 
noise was less pronounced. 

For each champion, we obtained the ranges of parameters for which it was 
robust using the following algorithm. We repeatedly extended the ranges of all 
parameters around their default values, by a small value (specific for each param- 
eter), at first in both directions. We then drew 100 random sets of parameters 
using such extended ranges, gave the same parameters to all neurons in the net- 
work, and checked if at least 90 among these 100 SNNs had TPR>0.90 and 
FDR <0.10 (each network was tested for one random, and thus different, input 
stream with 50000 signals, with equal probability of occurrence for A, B, and C). 
If so, the extended ranges were kept. If not, the ranges were shrunk back to the 
previous sizes and the problematic parameter was identified (by excluding one 
by one the parameters from extension, in one of the two directions, in the set of 
parameters for which the ranges can be extended, and checking if this allowed to 
extend the range keeping TPR > 0.90 and FDF <0.10). The algorithm stopped 
when the set of parameters for which the range could be extended became empty. 

The size of the ranges (maximum minus the minimum value) were compared 
for the networks evolved with the limit of three versus four interneurons using 
the James test implemented in the package Rfast of the R project (https://cran. 
r-project.org/). Proportions were compared using function prop.test in R. 
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3 Results and Discussion 


In 100 independent runs for 3000 generations each, when we allowed for three 
interneurons, 13 runs ended with perfect recognisers. When we allowed for four 
interneurons, 19 runs out of 100 resulted in perfect recognisers. Our previous 
work [23] suggested that at least three interneurons are needed to obtain perfect 
recognisers in the presence of noise; here also we were unable to evolve with 
noise when less than three interneurons were allowed, and none of the runs 
when the limit was set to three resulted in a champion with less. In contrast, 
two champions out of 19 obtained when the limit was set to four interneurons 
ended up having three interneurons. 

The size of the ranges of robustness for 13 networks evolved with the limit of 
three versus 17 networks with four interneurons was not significantly different. 
We then tested how robust were the networks when each neuron in the network 
was given a different set of parameters drawn from the obtained range (dur- 
ing the range expansion algorithm, all neurons always had the same parameters 
drawn from the range; in this test, as during expansion, we made 100 evalu- 
ations, each on a different random input stream with 50000 signals). None of 
the networks remained perfect recognisers, but some—noticeably champion 3 
evolved with three interneurons (champion 3/3)—were quite robust to such a 
disruption (Table 2), and so were champions 8/3 and 5/3; and for the networks 
with 4 interneurons, champions 1/4 and 12/4. 

We have previously proposed a way to map the network activity to the states 
of finite state transducers (FST) [23,24]. Before we did such a mapping for the 
networks obtained here, we first analysed which networks could maintain their 
state for a very long time (in practice, noise may prevent a given network from 
maintaining the states infinitely). Nine out of 13 networks evolved with three 
interneurons sustained elongation of intervals between signals from 16 ms to at 
least 100 ms (Table 2; we assume that if the silence can be extended to 100 ms, 
the network maintains its state). Only four out of 17 with four interneurons did 
so (Table 2). Thus the fraction of perfect recognisers maintaining their state is 


Table 2. Robustness of 13 networks evolved limiting the number of interneurons to 
three (top) and 19 networks evolved limiting the number of interneurons to four (bot- 
tom; champions with labels in bold evolved to have 3 interneurons), when sampling 
the neuronal parameters from the ranges of robustness specific for each champion, and 
their robustness to increased interval of silence between signals. 


0/3 1/3 2/3 3/3 4/3 5/3 6/3 7/3 8/3 9/3 10/3 11/3 12/3 
TPR>0.99 & FDR<0.01 37 37 27 80 19 53 4 52 67 14 24 16 8 
TPR>0.95 & FDR<0.05 71 79 75 97 68 9 93 98 93 80 50 51 79 
TPR>0.90 € FDR<0.10 84 86 86 100 85 9 97 99 99 91 71 65 93 
Maximum interval of silence >100 35 3100 28 3100 >100 48 +100 +100 +100 +100 +100 19 


0/4 1/4 2/4 3/4 4/4 5/4 6/4 7/4 8/4 9/4 10/4 11/4 12/4 13/4 14/4 15/4 16/4 17/4 18/4 


TPR>0.99 € FDR<0.01 5 66 1 39 39 18 58 39 11. 34 17 38 64 17 5 24 29 T 48 
TPR>0.95 & FDR<0.05 64 88 70 69 75 75 90 86 58 89 67 74 92 76 47 65 70 72 96 
TPR>0.90 € FDR<0.10 90 94 92 85 86 9 93 98 83 95 86 83 98 93 73 81 82 90 98 


Maximum interval of silence 17 20 24 21 36 3100 >100 >100 19 27 18 18 29 >100 23 50 18 2100 28 
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B/0, C/O 


Fig. 1. Minimal FST for recognizing ABC. The nodes represent the states and edges 
represent the transitions from one state to another state on receiving an input symbol 
{A, B, C} and producing an output {0: no spike(s), 1: spike(s) of the output neuron}. 


significantly larger for networks with three interneurons (p = 0.017; one-sided 
test). The reason for this might be that in networks with four interneurons 
the additional neuron acts as one more source of noise disrupting the memory 
maintained as self-sustained high-frequency spiking (see below). 

We considered the network most robust if it had the highest number of sets 
of parameter values among 100 sets independently sampled from the robust- 
ness ranges (such as shown in Table1) that gave TPR>0.99 and FDR <0.01. 
Interestingly, the most robust networks (3/3 and 1/4) failed to maintain their 
state. For mapping the network states on to the states of an FST, we have cho- 
sen therefore networks 8/3 and 7/4—the most robust of networks maintaining 
memory (Figs. 2 and 3). 

There are four states in a minimal-size FST that recognises a pattern that 
consists of three different signals in a specific order in a stream of three signals 
(Fig. 1). In both networks (8/3 and 7/4) the state of the network after they 
receive ABC (state hABC, for had ABC) is reached after a transition from a 
state in which all interneurons have zero or zero/low activity (neural states Z or 
L, respectively; Tables 3 and 4). The same was the case for all the other perfect 
recognisers obtained in this work (not shown). This means that the output in 
each network will spike if the input stream consists of a single signal, C. Since 
we are interested here in recognition in a continuous stream of signals, we do not 
consider it a serious issue. Perhaps, however, introducing a strong penalty for 
output spikes after the initial C would allow us to obtain networks with different 
structure and activity; we plan to investigate this in our future work. 

The interneurons of 8/3 are fully connected (Fig. 2), and all the interneurons 
have excitatory self-loops. However, it is not the case that full connectivity with 
self-loops for interneurons in networks evolved for three interneurons is a suffi- 
cient and necessary condition for state maintenance (for example, 6/3 and 12/3 
have such a topology, but do not maintain the state, while 11/3 does so without 
full connectivity). 

Going back to 8/3; both interneurons N1 and N2 self-excite themselves 
strongly—high-frequency spiking (H state) of N1 and N2 is observed in all states 
but hAB (which is maintained trivially—all neurons are inactive). When signal 
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Fig. 2. The topology and activity of network 8/3. 


Table 3. States of the neurons in network 8/3 in network states mapped on the states 
of the minimal FST. Z: zero, L: (zero or) low, H: high-spiking activity. See text for 
further details. 


S hA hAB hABC 
Neuron 0 L: 0, 2, 3 spikes | Z Z L: 3 spikes 
Neuron 1 H: 332+1Hz |L: 0, 1, 2 spikes | Z H: 331 + 1 Hz 
Neuron 2 | H: 333 Hz H: 334+1Hz |L: 1, 2 spikes H: 329 Hz 
Output Z Z Z L: 1, 2 spikes 


A is received, strong connection of input A to N2 puts N2 in the H state, and 
because of a strongly inhibitory connection both from input A and N2 to N1, N1 
is in an L state in the network state hA. The activity of input B strongly inhibits 
N2; this is why the transition from network state hA to hAB corresponds to L 
or Z states of all interneurons. When a network in such a state receives a C, the 
excitatory connection from input C to NO and NO's weak self-excitation combine 
to make NO spike exactly three times, which is necessary for the output to spike 
once or twice (output can be excited only by NO); connections from NO to N1 
and from N1 to N2 are mainly responsible for putting both N1 and N2 in an 
H state. When, however, C is received in any other state, either N2 (state hA) 
or both N1 and N2 (states S and hABC) are in state H; their strong inhibitory 
connections to output prevent output from spiking (Fig. 2). 

Limitations of space prohibit us from providing a similar analysis for 7/4. 
We do, however, provide the data (Fig. 3, Table 4) sufficient for making it. 

Our preliminary analysis of the variability of the ways in which computation 
in this task is accomplished in networks that show state maintenance indicates 
that networks evolved with three interneurons belong to four distinct classes 
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Fig. 3. The topology and activity of network 7/4. 


Table 4. States of the neurons in network 7/4 in network states mapped on the states 
of the minimal FST. Z: zero, L: (zero or) low, H: high-spiking activity. See text for 
further details. 


S hA hAB hABC 
Neuron 0 | Z L: 2, 3 spikes Z Z 
Neuron 1 |H: 330 +3 Hz | H: 280 +3 Hz L: 1 spike | H: 330 Hz 
Neuron 2 | H: 332 + 2 Hz | L: 0, 1, 3 spikes | Z H: 333 Hz 
Neuron 3 |L: 0, 4 spikes | Z Z L: 4 spikes 
Output |Z Z Z L: 1, 2 spikes 


based on the assignment of neural states to network states. For network 8/3 
we can encode this assignment as (S, hA, hAB, hABC) =(LHH, ZLH, ZZL, 
LHH), where Z means zero activity, L means zero or low activity (a few spikes at 
most), and H means high-frequency spiking. The order of symbols in each triplet 
assigned to a state follows the order of interneurons’ labels (Table 3). Three other 
networks belong to this class, 0/3, 9/3, and 11/3 (such matching requires, of 
course, appropriate ordering of interneurons in each network). The other three 
possible classes are: (i) 4/3 and 7/3 have (ZHH, ZHL, ZLZ, LHH), (ii) 2/3 and 
5/3 have (HHH, HLZ, LZZ, HHH), and (iii) 10/3 has (HHH, LHH, ZLL, HHH). 
The four networks that show state maintenance with four interneurons all belong 
to different classes based on such an assignment: whereas (i) 7/4 has (ZHHL, 
LHLZ, ZLZZ, ZHHL) (Table 4), (ii) 5/4 has (HZHH, HZHZ, LZHZ, HLHH), 
(ii) 13/4 has (HHHH, HLLH, LZZL, HHHH), and (iv) 17/4 has (HLLH, LZZH, 
ZZZL, HLLH). In our future work, we plan to further analyse the relationship 
between these classes and the network topologies, considering the signs and 
weights of the connections. 
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4 Conclusions and Future Work 


We show that SNNs evolved to perform a simple but not trivial computational 
task in the presence of noise on neuronal membrane potential are robust to sam- 
pling all neuronal parameters from a certain range, and provide a procedure 
to approximate this range. Not surprisingly, we show that the range for vary- 
ing all parameters is narrower than for varying a single parameter each time 
(as we did previously [23]). In future work, we plan to further fine tune this 
methodology—for example, by giving all neurons different parameters during 
this procedure, and considering the dependence relationships between parame- 
ters (we have observed, for example, that increasing the value of one parameter 
may allow increasing the value of another). 

Setting a limit for the number of interneurons one higher than necessary to 
accomplish the tasks increased the yield of successful evolutionary runs (i.e., the 
evolvability), but resulted in a smaller fraction of networks that could maintain 
their state in the successful runs. Furthermore, there was no significant impact 
on the range of robustness to changes of parameters between slightly smaller 
and larger networks. In future work, we plan to investigate if larger networks 
will allow obtaining solutions in the presence of higher levels of noise. We would 
also like to see if other models of noise (such as an Ornstein-Uhlenbeck process, 
commonly used in computational neuroscience) impact evolvability and robust- 
ness. Another possible direction for future work is to investigate the evolution 
of recognition of longer patterns in the presence of noise. 

In this work, we performed a preliminary analysis of how the networks accom- 
plish the temporal pattern recognition with state maintenance by assigning neu- 
ral states in network states corresponding to the state of an FST. We show that 
the solutions belong to different classes, and thus different topologies can allow 
solving this task. In future work, we will analyse in more detail the variety of 
solutions obtained in independent runs. We would also like to see if changing 
the spiking behavior of neurons during evolution (e.g., to bursting) or the model 
itself (e.g., to leaky integrate and fire) leads to other classes of solutions. 
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Abstract. Toxicology studies are subject to several concerns, and they 
raise the importance of an early detection of the potential for toxicity of 
chemical compounds which is currently evaluated through in vitro assays 
assessing their bioactivity, or using costly and ethically questionable in 
vivo tests on animals. Thus we investigate the prediction of the bioac- 
tivity of chemical compounds from their physico-chemical structure, and 
propose that it be automated using machine learning (ML) techniques 
based on data from in vitro assessment of several hundred chemical com- 
pounds. We provide the results of tests with this approach using several 
ML techniques, using both a restricted dataset and a larger one. Since 
the available empirical data is unbalanced, we also use data augmenta- 
tion techniques to improve the classification accuracy, and present the 
resulting improvements. 


Keywords: Machine learning + Toxicity - QSAR - Data augmentation 


1 Introduction 


Highly regulated toxicology studies are mandatory for the marketing of chemi- 
cal compounds to ensure their safety for living organisms and the environment. 
The most important studies are performed in vivo in laboratory animals dur- 
ing different times of exposure (from some days to the whole life-time of the 
animal). Also, in order to rapidly get some indication of a compound’s effects, 
in vitro assays are performed using biological cell lines or molecules, to obtain 
hints about the bioactivity of chemicals, meaning their ability to affect biological 
processes. However, all of these studies raise ethical, economical and time con- 
cerns; indeed it would be ideal if the toxicity of chemical compounds could be 
assessed directly through physical, mathematical, computational and chemical 
means and processes. 

Therefore, in order to predict as early as possible the potential toxic effect of 
a chemical compound, we propose to use machine learning (ML) methods. The 
ambitious objective is to predict long term effects that will be observed in in vivo 
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studies, directly from chemical structure. Nonetheless, this long term prediction 
seems to be difficult [24] because of the high level of biological variability and 
because toxicity can result from a long chain of causality. Therefore, in this paper 
we investigate whether taking into consideration the in vitro data, can improve 
the quality of the prediction. In such a case the global objective of the long term 
toxicity prediction could be split into two parts: (i) first the prediction of in vitro 
bioactivity from chemical structure [27], and (ii) secondly the prediction of long 
term in vivo eflects from in vitro bioactivity [23]. 

Here we focus on the first part (i) using ML approaches to determine a 
“quantitative structure-activity relationship” (QSAR) [17]. QSAR models aim 
at predicting any kind of compounds activity based on their physico-chemical 
properties and structural descriptors. Our purpose is to predict using an ML app- 
roach, whether a compound’s physico-chemical properties, can be used to deter- 
mine whether the compound will be biologically active during in vitro assays. If 
ML could be shown to be effective in this respect, then it would serve to screen 
compounds and prioritize them for further in vivo studies. Then, in vivo toxicity 
studies would only be pursued with the smaller set of compounds that ML has 
indicated as being less bioactive, and which must then be certified via in vivo 
assessment. Thereby a significant step forward would be achieved, since animal 
experimentation could be reduced significantly with the help of a relevant ML 
based computational approach. 

This paper is organized as follows. Section2 details the data, algorithms 
and performance metrics used in this work. Section3 presents the first results 
obtained on a subset of data. Section 4 shows the performance of an algorithm 
on the global dataset. Finally, we conclude in Sect. 5. 


2 Learning Procedure 


In this section we first describe the data used, then the ML algorithms that are 
tested and finally the metrics used to evaluate performances of the models. 


2.1 Data Description 


Since the long term objective aims at predicting in vivo toxicity, we need publicly 
available data for both in vivo and in vitro experimental results. The US Envi- 
ronmental Protection Agency (EPA) released this type of data in two different 
databases: (i) ToxCast database contains bioactivity data obtained for around 
10,000 of compounds tested in more than several hundreds in vitro assays [7], (ii) 
the Toxicity Reference database (ToxRefDB) gathers results from several types 
of in vivo toxicity studies performed for several hundreds of chemicals [20]. It 
is important to notice that not all the compounds have been tested in all the 
assays from ToxCast and in each type of in vivo studies present in ToxRefDB. 
Still guided by the long term objective, we consider a subset of these data 
including compounds for which both in vitro and in vivo results were available. 
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The subset selection follows three steps. First, we look for the overlap of com- 
pounds present both in ToxCast and ToxRefDB and having results for in vivo 
studies performed in rats during two years. We obtain a matrix with 418 com- 
pounds and 821 assays, with a lot of missing values. Secondly, we look for a large 
complete sub-matrix and we obtain a matrix of 404 compounds and 60 in vitro 
assays. Finally, in order to be sure to get a minimum of active compounds in 
the datasets, i.e. compounds for which an AC50 (half maximal activity concen- 
tration), could be measured, we remove assays with less than 5% of them and 
obtain a final matrix of 404 compounds and 37 assays. 

For each of the 37 assays, we build a QSAR classification model to predict 
the bioactivity of a compound. These models use structural descriptors com- 
puted from the compound’s structure described in Structured Data Files. Two 
types of descriptors are used: (i) 74 physico-chemical properties (e.g. molecular 
weight, logP, etc.) which are continuous and normalized variables and (ii) 4870 
fingerprints which are binary vectors representing the presence or absence of a 
chemical sub-structure in a compound [21]. Fingerprints being present in less 
than 5% of compounds are removed, leading to a final set of 731 fingerprints. 
Therefore, the obtained dataset is composed of 805 structural descriptors for the 
404 compounds. 

The property that we wish to predict, is the activity in each in vitro assay in 
a binarised form. It is generally measured as a AC50 value which is the dose of 
compound required to obtain 50% of activity in the assay. In the following, we 
consider that the binary version of the activity is 0 if AC50 value equals 0 and 
1 otherwise. 


2.2 Learning Algorithms 


— The Random Neural Network (RNN) is a mathematical model of the 
spiking (impulse-like) probabilistic behaviour of biological neural systems [9, 
11] and it has been shown to be a universal approximator for continuous and 
bounded functions [10]. It has a compact computationally efficient “product 
form solution”, so that in steady-state the joint probability distribution of the 
states of the neurons in the network can be expressed as the product of the 
marginal probabilities for each neuron. The probability that any cell is excited 
satisfies a non-linear continuous function of the states of the other cells, and it 
depends on the firing rates of the other cells and the synaptic weights between 
cells. The RNN has been applied to many pattern analysis and classification 
tasks [6]. Gradient descent learning is often used for the RNN, but in this 
work we determine weights of the RNN using the cross-validation approach 
in [28]. 

- The Multi Layer RNN (MLRNN) uses the original simpler structure 
of the RNN and investigates the power of single cells for deep learning [25]. 
It achieves comparable or better classification at much lower computation 
cost than conventional deep learning methods in some applications. A cross- 
validation approach is used to determine the structure and the weights and 
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20 trials are conducted to average the results. The structure of the MLRNN 
used here is fixed as having 20 inputs and 100 intermediate nodes. 

— The Convolutional Neural Network (CNN) is a deep-learning tool [18] 
widely used in computer vision. Its weight-sharing procedure improves train- 
ing speed with the stochastic gradient descent algorithm recently applied to 
various types of data [15,26]. In this work, we use it with the following layers: 
“input-convolutional-convolutional-pooling-fully*connected-output” [5]. 

— Boosted Trees (called XGBoost in the sequel) is a popular tree ensemble 
method (such as Random Forest). The open-source software library XGBoost 
[4] provides an easy-to-use tool for implementing boosted trees with gradient 
boosting [8] and regression trees. 


2.3 Classification Settings and Performance Metrics 


For each of the 37 assays, we randomly subdivide the corresponding dataset D 
into a training set Dr and a testing set D,. From D we randomly create 50 
instances of Dr and its complementary test set D; so that for each instance, 
D = Dr U D. Each of the ML techniques listed above are first trained on each 
Dr and then tested on D+. The results we present below are therefore averages 
over the 50 randomly selected training and testing sets. Since the output of the 
datasets is either 0 or 1, this is a binary classification problem. 

Let TP, FP, TN and FN denote the number of true positives, false positives, 
true negatives and false negatives, respectively. Then the performance metrics 
that we use to evaluate the results are the Sensitivity (TP/(TP + FN)), the 
Specificity (TN/(TN + FP)) and the BalancedAccuracy, denoted for short 
BA ((Sensitivity + Specificity) /2). 


3 Classification Results 


In the 37 datasets corresponding to the 37 assays, the ratio between positive 
and negative compounds varies between 5% and 30% with a mean around 12%. 
This highlights the unbalanced property of the data in the favor of negative 
compounds. Here we test the ML algorithms on these unbalanced data and after 
balancing using data augmentation. 


3.1 Results on Unbalanced Datasets 


The MLRNN, RNN, CNN and XGBoost algorithms are exploited to classify the 
50 x 37 pairs of training and testing datasets and results are summarized into 
Fig. 1. Since these are unbalanced datasets, the BA may be a better metric to 
demonstrate the classification accuracy. In addition, the situation of misclassify- 
ing positive as negative may be less desirable than that of misclassifying negative 
as positive. Therefore, the metric of Sensitivity is also important. 

When looking at the BA obtained on the training data set (Fig. 1(a)), we 
observe that the RNN method is not good at learning from these unbalanced 
datasets, while the CNN, MLRNN and XGBoost techniques learn much better. 
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Fig. 1. Training and testing mean-value results (Y-axis) versus different assays (X-axis) 


when the CNN, MLRNN, XGBoost, RNN are used for classification. 


Compared to the training accuracy, the performance on the testing dataset is 
more important since it demonstrates whether the model generalises accurately 
with regard to classifying previously unseen chemical compounds. The testing 
results are presented in Figs. 1(d) to (f). Here, we see that RNN performs the 
worst in identifying true positives (Sensitivity) and tends to classify most unseen 
chemical compounds as inactive, except for some assays. It can be explained by 
the overall number of inactive compounds much larger than the number of active 
compounds in the training dataset. The CNN, MLRNN and XGBoost perform 
a bit better in identifying the TPs, and the MLRNN performs the best. But 
Sensitivity is still low and really depends on the assays and probably on the 
balance between active and inactive compounds in the corresponding datasets. 

Among all assays, the highest testing BA achieved by these classification 
tools is 68.50% attained by the CNN for assay number 4, with the corresponding 
Sensitivity being 47.10%. Among all assays, the highest testing Sensitivity is 
47.75% (MLRNN for assay 17) with a corresponding BA of 60.80%. 


3.2 Results on Balanced Datasets 


From the previous results, it appears that most of the classification techniques 
used are not good at learning unbalanced datasets. Therefore, we try balancing 
the 50 x 37 training datasets with data augmentation, while the corresponding 
testing datasets remain unchanged. 

Here, the CNN, MLRNN, RNN and XGBoost are used to learn from the 
50 x 37 datasets which are augmented for balanced training using the SMOTE 
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method [3] as implemented in the Python toolbox unbalanced_learn [19]. The 
resulting Sensitivity, Specificity and BA are summarised in Fig. 2. 
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Fig. 2. Training and testing mean-value results (Y-axis) versus different assays (X-axis) 


on balanced datasets. 


Compared to the training balanced accuracies given in Figs. 1(a) and 2(a) 
shows that it is now evident that all the classification techniques we have dis- 
cussed are capable of learning the training datasets after data augmentation. 
The training BA of the RNN method is still the lowest, but its testing BA is 
the highest for most of the assays. 

Among all assays, the highest testing BA is 68.88% which is obtained with 
the RNN for the assay 17, with the corresponding testing Sensitivity being 66% 
and which is also the highest testing Sensitivity observed. Note that these values 
are higher than those reported in Fig. 1. 

Finally, for a better illustration, Fig.3 compares the highest testing results 
obtained among all classification tools for classifying the datasets before and after 
data augmentation. This figure highlights the clear improvement of Sensitivity 
for all assays, which also leads to a better BA for most of them. Not surpris- 
ingly, Specificity is decreased after data augmentation since the proportion of 
negatives in the balanced training sets is much lower compared to the original 
ones. Therefore, the models do not predict almost everything as negative as they 
did before data augmentation. 
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Fig. 3. Comparison between the highest testing results (Y-axis) versus different assay 
index (X-axis) on both unbalanced and balanced datasets. 


4 Classification Results on Extended Datasets 


4.1 New Datasets and Learning Procedure 


In this section we use a bigger dataset of 8318 compounds to classify the same 37 
assays. This 8318 x 37 matrix is not complete since not all the compounds were 
tested in all the assays. Thus, for each of the 37 assays, we build a classification 
model based on the compounds which were actually tested in the assay, leading to 
different datasets for each assay. Note that, as previously, the instance numbers 
of the two classes are very unbalanced. 

Compared to the previous datasets, all the generated fingerprints are included 
in the global dataset which corresponds to 4870 fingerprints in total (added to 
the 74 molecular descriptors previously described). Nonetheless, for each of the 
37 assays and before the learning, a descriptor selection is performed based on 
two steps: (i) descriptors having a variance close to 0 (in such case, they are 
not sufficiently informative) are removed, (ii) Fisher test is computed between 
each descriptor and the output assay and descriptors are ranked according to 
the obtained p-value; we keep the 20% best descriptors. 

Random Forest (RF) classifier, an ensemble technique that combines many 
decision trees built using random subsets of training examples and features [2], 
is used for the learning because is has the advantage to deal with a large number 
of features without overfitting. A 10-fold cross-validation is performed 10 times 
and the average Sensistivity, Specificity and BA are computed to evaluate the 
internal performance of the classifiers. As previously, we test the RF classifier 
on both unbalanced and balanced datasets. 


4.2 Results on Unbalanced Datasets 


Figure 4 presents the results obtained with the method described above applied 
to the datasets used in Sect.3 as well as to the extended ones described in 
Sect. 4.1. We observe that, for both ensembles of datasets, the RF method is not 
good at identifying TPs (Sensitivity < 50%) and is predicting almost all com- 
pounds as negatives (Specificity > 90%). However, we see that the extended 
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datasets lead to higher performance for most ofthe assays. Among all, the highest 
BA achieved by the RF is 71.08% for the assay 17 with corresponding Sensitivity 
and Specificity of 47.10% and 95.05% respectively. When looking at the dis- 
tribution between active and inactive compounds in all assays, we see that the 
assay 17 is the one which has the less unbalanced dataset with 30% of actives 
in the initial dataset and 22% in the extended one. This could explain that 
this assay always lead to the best performances. Also, the percentage of active 
compounds for each assay in the extended dataset is always lower compared to 
the initial dataset (data not shown). Nevertheless, since the results are better 
with the extended dataset, it seems that the total number of observations has 
an impact on the results and not only the ratio between actives and inactives. 
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Fig. 4. Results of RF algorithm (Y-axis) versus different assays (X-axis). 
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Fig. 5. Results of the RF algorithm (Y-axis) versus different assays (X-axis) on bal- 
anced datasets. 


4.3 Results on Balanced Datasets 


Figure 5 presents the results obtained with the same protocol but with the data 
augmentation method SMOTE applied to each training dataset of the cross- 
validation. As in Sect. 3, we observe that for extended datasets, all the results 
are improved after data augmentation (Sensitivity is increased by 8% in average 
and BA by 3%). But still, the Sensivity is low compared to the Specificity. 
Among all assays, the highest BA achieved by the RF on the extended dataset is 
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73.64% with corresponding Sensitivity and Specificity of 54.93% and 92.36% 
respectively, still for the assay 17. These results highlight that both the total 
number of compounds in the dataset and the ratio between active and inactive 
compounds have an impact on the performance of the models. Indeed, having a 
bigger dataset which is balanced allows increasing performances. 


5 Conclusion and Perspectives 


From the results presented here, we can draw several conclusions. First, the 
methods we have proposed can correctly predict bioactivity from the physico- 
chemical descriptors of compounds. However, some methods appear to be signif- 
icantly better than others. Also, this appears to depend strongly on the assays 
themselves and their corresponding datasets. Moreover, we showed that the use 
of a larger dataset improves the classification performance, even if the data is 
unbalanced. Furthermore, we see that data augmentation techniques can play 
an important role in classification performance for the unbalanced datasets. 

This work on ML applied to toxicology data raises further interesting issues. 
Since there is no absolute winner among the classification techniques that we 
have used, we may need to test other methods such as Support Vector Machines 
(SVM) [1] or Dense Random Neural Networks (DenseRNN) [14]. Also, it would 
be interesting to apply the algorithms used on the small dataset to the extended 
one and compare against the RF method. We may also test other data augmen- 
tation techniques to seek the most appropriate ones [16]. Furthermore, in order 
to assess the prediction accuracy of bioactivity for a new compound, it is impor- 
tant to know if this compound has a chemical structure that is similar to the 
ones used in the training set. For this, we could use the “applicability domain” 
approach [22] as a tool to define the chemical space of a ML model. Finally, if we 
refer to the long term objective of this work which is to link the molecular struc- 
ture to in vivo toxicity, we could think about using the approach we have used 
as an intermediate step, and also train ML techniques to go from in vitro data to 
the prediction of in vivo effects. However, some preliminary tests that we have 
carried out (and not yet reported), reveal a poor correlation between in vitro 
and in vivo results, so that other data that is more directly correlated to toxicity, 
could be considered in future ML predictive models of toxicity. In addition, we 
could consider combining the results obtained with several ML methods, similar 
to a Genetic Algorithm based combination [12,13], to enhance the prediction 
accuracy. 
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Abstract. In this work, an energy-based clustering method is used to prune 
heterogeneous ensembles. Specifically, the classifiers are grouped according to 
their predictions in a set of validation instances that are independent from the ones 
used to build the ensemble. In the empirical evaluation carried out, the cluster that 
minimizes the error in the validations set, besides reducing computational costs 
for storage and the prediction times, is almost as accurate as the complete 
ensemble. Furthermore, it outperforms subensembles that summarize the com- 
plete ensemble by including representatives from each of the identified clusters. 


Keywords: Machine learning - Clustering analysis - Classifier ensembles 
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1 Introduction 


In ensemble learning, the outputs of a collection of diverse predictors are combined to 
yield a global prediction that is expected to be more accurate than the individual ones. 
The key to obtaining accuracy improvements is that the predictors be complementary. 
This means that their errors should be independent, so that the mislabeling of an 
instance by a given classifier can be compensated in the combination process by correct 
predictions from other classifiers. A homogeneous ensemble is composed of predictors 
of the same type. Since the ensemble classifiers are trained on the same set of labeled 
data, diversification mechanisms are needed to generate predictors that are actually 
different (Dietterich 2000). To this end, instabilities of the learning algorithm that is 
used to build the individual ensemble members can be exploited. Heterogeneous 
ensembles are composed of classifiers of different types. In practical applications they 
have proven to be very effective: The aggregation of the predictions of classifiers of 
different types can be used to compensate their individual biases, which should be 
distinct. In spite of their practical advantages, heterogeneous ensembles have not been 
analyzed as extensively as their homogeneous counterparts. This analysis is the major 
novelty of this work. One reason for this gap in the literature is the difficulty of 
analyzing their aggregated prediction. Specifically, it is no longer possible to assume 
that the predictions of the classifiers on an individual instance are independent iden- 
tically distributed random variables (Lobato et al. 2012). 

The main drawback of ensemble methods is their high computational costs in terms 
of space and time: All the predictors need to be stored in memory. Furthermore, one 
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needs to query every ensemble member to compute the final, aggregated prediction. In 
homogeneous ensembles, pruning techniques have been designed to identify subsets of 
classifiers whose predictive accuracy is equivalent, or, in some cases, better than the 
complete ensemble (Suärez et al. 2009). In this manner, both memory costs and pre- 
diction times are reduced, which could be a key advantage in real-time applications. In 
this work, we propose to analyze the problem of pruning heterogeneous ensembles 
using a novel perspective based on clustering techniques. Previously, clustering has 
been used to identify representatives that can be used to effectively summarize a 
complete ensemble (Bakker and Heskes 2003). For homogeneous ensembles, clus- 
tering can be made on the basis of the parameters of the models or based on the models’ 
outputs on a dataset, typically a validation or a test set independent of the data used for 
training. Given the disparate nature of the ensemble classifiers, in heterogeneous 
ensembles only the latter ensemble clustering technique can be applied. For the sake of 
completeness, we describe the energy-based clustering algorithm described in (Bakker 
and Heskes 2003) in the following section. 


2 Ensemble Clustering Based on Model Outputs 


i ii Nirain : . 
Let Drain = fogan. ram) y be a set of labeled instances used to build the 


train 
n 


eY is the corresponding class label. An ensemble 


ensemble. The components of the vector x”“" eX are the attributes of the nth instance 


in the training set. The value y” 


H= {he}, is composed of C predictors. The cth predictor in the ensemble is a 
function he : X — Y that takes attribute vectors as inputs and yields a class label. 
Specifically, h.(x) is the prediction of the cth ensemble member on the instance 


characterized by the vector of attributes x €X. The global ensemble prediction for this 
instance is given an aggregation of the individual prediction (x) = A Hr] . In this 
work, the individual outputs of the ensemble predictors are aggregated using (un- 
weighted) majority voting. 

The goal of clustering is to making groupings based on similarities among the 
outputs of the members of the ensemble on an set of validation instances 


hi! = {he (ae ) e y“. To avoid biases, the validation set should be independent 
of the training set. Since the class labels are not needed for clustering, the test set, if 
available in the training phase, can be used for clustering. The clusters are characterized 
by their centroids [my e Y";k = 1,...,K}. To identify the clusters one could use 
some standard algorithm, such as K-means or its fuzzy version (Bezdek et al. 1984) 
(MacQueen 1967). However, from our empirical investigation, the energy-based 
clustering method introduced in (Bakker and Heskes 2003) is more effective. In this 
procedure, one minimizes the free energy, which is the difference between an enthalpic 
and an entropic term 


(P*,M*) = arg min F(P,M) = arg min[A(P, M) — TS(P)]. (1) 
(P.M) (P.M) 
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The free energy depends on the CxK matrices M = {m,}_, and P = {p}; 
where p; = {pate , and p, is the probability that the classifier he belongs to cluster 


K 

k. By normalization, X` p. = 1. The enthalpy is the average distance of the classifiers 
k=1 

to the cluster centroids 


C K 
= y S paD( (he, m4), (2) 


c=1 k=1 


where D(h., my) is the distance between the cth classifier in the ensemble and centroid 
k. In principle, any distance function, such as the mean-square error, or the cross- 
entropy error can be used. The minimum of (2) is achieved when all the ensemble 
members are assigned to the nearest cluster; that is, predictor h. is assigned to cluster 


k* = arg min D(h., my). (3) 


The entropy is a measure of how sharply the clusters are defined 


C K 
=-) Y) palogpa. (4) 


c=1 k=1 


The term proportional to the entropy is included in the objective function to avoid 
that the clustering algorithm gets trapped in a local minimum. At the beginning of the 
search, in the absence of knowledge of the structure of the clusters, the temperature 
parameter takes a high value to favor exploration. As the algorithm proceeds, T is 
decreased according to a deterministic annealing schedule (Rose 1998). At a fixed 
temperature T > 0, and for fixed values of the cluster centroids (mg an , the solution of 
the optimization problem (1) is of the softmax form 


e- BP (bmx) 


Pa = SR ptm)? = yo Bs (5) 


where $ =+ is the inverse temperature (Rose 1990; Buhmann and Kühnel 1993; 
Bakker and Heskes 2003). In the infinite temperature limit f — oo, a given ensemble 
member is assigned to all clusters with equal probability. At low temperatures, only 
configurations around the minimum of (2) are explored. In the limit of zero temperature 
PB — 0, the clusters become sharply defined according to (3). For each annealing epoch, 
the value of the temperature is fixed. The expectation-maximization algorithm is then 
used to find the optimum of the free energy. If the mean-squared error or the cross- 


entropy error are used as distance function, starting from an initial configuration of the 


[0] 


probabilities p, , the update rule is 
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, Cc C ln 
mË — arg max X` p% D(h., mg) — Zefa Me, k= 1, 0-5 


c=1 c=1 Pek 


K (6) 


„K (7) 


Iterative updates of the maximization and the expectation steps, given by Eqs. (6) 
and (7), respectively are made until convergence. While the cluster centroids and the 
probabilities have not converged, the inverse temperature is incremented according to 
the annealing schedule. Following the prescription given in (Bakker and Heskes 2003), 
initially 6 = 1. This value is incremented by 1 at each annealing epoch until the 
clusters become sufficiently sharp (the centroids have reached convergence and the 
clusters remain practically unalterable). 


3 Empirical Evaluation 


The goal of ensemble pruning is to reduce the costs of storage and the time for the 
predictions without a significant loss (in some cases, with improvements) of accuracy. 
Clustering can be used to carry out this selection in different ways. For instance, the 
ensemble can be replaced by representatives from each of the identified clusters, as in 
(Bakker and Heskes 2003). In this work, we take a different approach and attempt to 
identify the most accurate cluster. To this end, we select the cluster that has the lowest 
predictors assigned to cluster k. The accuracy of this subensemble is then evaluated on 
a test set that is independent of both the training and the validation set. 

The experiments have been carried out in 10 different classification problems from 
the UCI repository (Bache and Lichman 2017). For each classification problem, 1/3 of 
the labeled instances are set aside for testing. From the remaining 2/3, 80% are used for 
training and 20% for validation. Using the training data, 100 multilayer percetrons 
(MLP) and 100 random trees (RT) are built using the Scikit-learn Python package [10]. 
Each of the classifiers in this heterogeneous ensemble is built on a bootstrap sample of 
the same size as the original training set, as in bagging (Breiman 1996). The random 
trees are built as in random forest (Breiman 2001), using the following settings: 
Random subsets whose size is the square root of the total number of attributes are 
considered for the splits at the inner nodes of the random trees. The split that minimizes 
the Gini impurity is selected. Splits are made until either the node is pure or it has only 
2 instances. Five different clusters are identified on the basis of the predictions of the 
ensemble classifiers on the validation instances using the algorithm described in the 
previous section. Similar accuracies (but different pruning rates) are obtained fixing the 
number of clusters to 2, 3 or 7. The best cluster is selected using also the validation set, 
which is independent from the one used for training. The results of the empirical 
evaluation performed are summarized in Table 1. The values displayed in the columns 
labeled Eest are the test error rate averaged over 30 independent train/test partitions 
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followed by the standard deviation after the + symbol. The errors reported in second 
column correspond to the homogenous bagging ensemble composed of the 100 MLPs 
that have been built. The second column corresponds to a random forest composed of 
the 100 random trees generated. The third column corresponds to the heterogeneous 
ensemble that includes both the 100 MLPs and the 100 RFs. Finally, the composition 
of the optimal cluster (k*) and the corresponding test error are displayed in the fourth 
and fifth columns, respectively. The size of the optimal cluster is Cx. The number of 


MLP’s in this cluster is C De The number of RT’s is C para] 


Table 1. Summary of the results of the empirical evaluation 


Ensemble MLP RF MLP + RF Best cluster k* 

Erest Erest Erest Ce Di ] + cear! i) Erest 
Blood 0.246 + 0.026 | 0.263 + 0.024 | 0.252 + 0.018 | 42 (42 + 0) 0.243 + 0.026 
Breast 0.046 + 0.011 | 0.062 + 0.011 | 0.068 + 0.011 | 43 (42 + 1) 0.045 + 0.009 
cancer 
Wisconsin 
Cars 0.123 + 0.024 | 0.103 + 0.015 | 0.118 + 0.016 | 51 (7 + 44) 0.118 + 0.016 
Chess 0.021 + 0.006 | 0.025 + 0.004 | 0.023 + 0.005 | 45 (30 + 15) 0.021 + 0.005 
Diabetes 0.239 + 0.022 | 0.246 + 0.013 | 0.239 + 0.009 | 35 (32 + 3) 0.239 + 0.021 
(Pima) 
German 0.275 + 0.021 | 0.285 + 0.017 | 0.269 + 0.015 | 46 (42 + 4) 0.271 + 0.019 
Heart 0.472 + 0.045 | 0.466 + 0.040 | 0.450 + 0.033 | 47 (2 + 45) 0.452 + 0.035 
disease 
Liver 0.391 + 0.043 | 0.382 + 0.075 | 0.454 + 0.062 | 37 (1 + 36) 0.387 + 0.077 
SPECT 0.350 + 0.044 | 0.350 + 0.031 | 0.336 + 0.029 | 43 (34 + 9) 0.350 + 0.039 
heart 
Tic-tac-toe | 0.203 + 0.028 | 0.118 + 0.024 | 0.132 + 0.019 | 44 (3 + 2) 0.118 + 0.023 


From these results it is apparent that, in most of the problems analyzed, the 
accuracy of the selected cluster is comparable to the best among the three complete 
ensembles. Furthermore, one achieves a pruning rate of ~20%, which directly trans- 
lates into a five-fold reduction of storage needs and prediction times. These optimal 
clusters are fairly homogeneous: In six of the problems analyzed, it is composed mostly 
of MLPs; in the remaining four, random trees form a majority. 

An interesting question is whether these pure ensembles are more accurate that 
ensembles that retain a single representative per cluster as in (Bakker and Heskes 
2003). To provide a more fair comparison, we consider also the possibility of sum- 
marizing the ensemble by retaining multiple representatives per cluster so that the final 
subensemble has the same size as the selected cluster. The results of this comparison, 
which are presented in Table 2 show that, in fact, the increased diversity of the 
ensembles of representatives, is detrimental and increases the test error. 

In summary, we have applied an energy-based clustering method to identify a 
subensemble whose accuracy is comparable to the complete heterogeneous ensemble, 
which is composed of random trees and multilayer perceptrons. The selected 
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Table 2. Test error rates for clustering-based pruned ensembles 


Ensemble Single representatives | Multiple representatives Best cluster k* 
Blood 0.251 + 0.023 0.253 + 0.023 0.243 + 0.026 
Breast cancer Wisconsin | 0.055 + 0.015 0.056 + 0.015 0.045 + 0.009 
Cars 0.123 + 0.024 0.123 + 0.024 0.118 + 0.016 
Chess 0.023 + 0.004 0.023 + 0.004 0.021 + 0.005 
Diabetes (Pima) 0.0246 + 0.017 0.249 + 0.015 0.239 + 0.021 
German 0.277 + 0.020 0.277 + 0.020 0.271 + 0.019 
Heart disease 0.472 + 0.045 0.479 + 0.044 0.452 + 0.035 
Liver 0.392 + 0.059 0.394 + 0.060 0.387 + 0.077 
SPECT heart 0.347 + 0.044 0.345 + 0.038 0.350 + 0.039 
Tic-tac-toe 0.162 + 0.046 0.159 + 0.048 0.118 + 0.023 
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subensemble is fairly homogeneous: it is either composed mainly of MLPs or mainly of 
RTs. Contrary to what could be expected, in this particular setting the reduction of 
diversity leads to improvements of accuracy. 
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Abstract. In this paper, we propose a hand gesture recognition model 
based on superficial electromyographic signals. The model responds in 
approximately 29.38 ms (real time) with a recognition accuracy of 90.7%. 
We apply a sliding window approach using a main window and a sub- 
window. The sub-window is used to observe a segment of the signal seen 
through the main window. The model is composed of five blocks: data 
acquisition, preprocessing, feature extraction, classification and postpro- 
cessing. For data acquisition, we use the Myo Armband to measure the 
electromyographic signals. For preprocessing, we rectify, filter, and detect 
the muscle activity. For feature extraction, we generate a feature vector 
using the preprocessed signals values and the results from a bag of func- 
tions. For classification, we use a feedforward neural network to label 
every sub-window observation. Finally, for postprocessing we apply a 
simple majority voting to label the main window observation. 


Keywords: Artificial Neural Networks - Electromyography 
Hand gesture recognition - Machine learning - Signal processing 


1 Introduction 


Hand gesture recognition consists of identifying the instant and the class associ- 
ated with a movement of the hand [1]. Hand gesture recognition has many appli- 
cations in the scientific and technological fields, for example: human computer 
interfaces (HCI), active prosthesis, and interaction with virtual environments [2]. 
A model that is suitable for these types of applications requires high recognition 
accuracy and usually has to respond in real time (i.e., in less than 300 ms) [3]. 
Additionally, some applications (e.g., HCI) require a recognition model to run 
on a computer with limited resources of RAM memory and processing. Hand 
gesture recognition models commonly use sensors like instrumented gloves, color 
cameras, depth cameras, and electromyographic sensors to acquire the input 
data for the model [4-6]. In this work, we use electromyographic (EMG) sensors 
because they are not affected by the variations of light, position and orientation 
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of the hand. According to the scientific literature, the state-of-the-art recogni- 
tion accuracy is about 85% for the models that use electromyographic sensors for 
hand gesture recognition [7]. For this reason, in this work our goal is to develop 
a model that achieves a recognition accuracy higher than 85% and responds in 
real time with limited resources of memory and processing. 

Machine learning is a framework that can be used to solve the problem of 
hand gesture recognition based on superficial electromyographic (sEMG) sig- 
nals. The most common classifiers for hand gesture recognition include: Support 
Vector Machines [8], Artificial Neural Networks [9,10], Deep Convolutional Neu- 
ral Networks [11], and k-Nearest Neighbors [12,13]. The conventional features 
used for hand gesture recognition are defined in the following domains: time 
(e.g., Mean Absolute Value and Zero Crossing), frequency (e.g., Mean Frequency 
and Frequency Histograms) and time-frequency (e.g., Wavelets). Models based 
on these classifiers and feature domains present high recognition accuracy and 
respond in real time. However, they also have some disadvantages, for instance: 
small number of predicted classes [7], too many repetitions for training the model 
[14], and demand for high computational resources [11]. Therefore, hand gesture 
recognition is still an open problem for new research. 

In this paper, we develop a hand gesture recognition model based on sEMG 
signals that responds in real time, achieves a recognition accuracy over the state- 
of-the-art, and works in a computer with limited resources of RAM memory and 
processing. The proposed model follows a sliding window approach using a main 
window and a sub-window. The model is composed of the following blocks: data 
acquisition, preprocessing, feature extraction, classification, and postprocessing. 
For data acquisition, we measure the sEMG signals using the Myo Armband. For 
preprocessing, we rectify, filter and detect the muscle activity in the main window 
observation. For feature extraction, we generate a feature vector by concatenat- 
ing the values of the preprocessed signal with the results of applying a bag of 
functions. For classification, we use a feedforward neural network to label every 
sub-window observation. Finally, for postprocessing we apply a simple majority 
voting, based on the labels from the sub-window classification, to label the main 
window observation with the corresponding gesture. The source code and the 
data used in this work are publicly available in the following link: https://drive. 
google.com/drive/folders/1rNgBFC38WXfruBocWmJnWNrR0iuA0HQw. 

Following this introduction, this paper is organized in three sections. In 
Sect.2, we describe the materials and methods used in this work. In Sect. 3, 
we present the results obtained. Finally, in Sect. 4, we present the conclusions 
and outline future work. 


2 Materials and Methods 


2.1 Materials 


Myo Armband. In this work, we use the Thalmic’s Myo Armband illustrated 
in Fig. 1(a) because it provides an open software development kit, has low cost, 
can be expanded from 19 to 34cm, and weighs only 93 g. [15]. The Myo includes 
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the following components: 8 superficial electromyographic sensors (Fig. 1(b)), a 
Bluetooth 4.0, and a 9-axes inertial measurement unit. The Myo streams data 
at 200Hz and represents every measured value with 8 bits [16]. The Myo is 
also equipped with a proprietary software (black box model) that recognizes five 
gestures: Fist, Wave In, Wave Out, Fingers Spread, and Double Tap (Fig. 1(c)). 


Fist Wave Left Wave Right 


So © 


Fingers Spread Double Tap 


cs 
r S 
(c) 


Fig. 1. (a) Myo Armband and (b) its channels. (c) Gestures detected by the Myo. 
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Dataset. In this paper, we use the data of 10 healthy volunteers used previously 
in [12,13] for training, validation and testing. We used this dataset to compare 
the proposed model with the previous models presented in [12,13]. This dataset 
contains a set for training and another set for testing. The training set consists of 
five repetitions of the five gestures indicated in Fig. 1(c) recorded during two sec- 
onds. Additionally, the training set includes five sEMG measurements recorded 
during two seconds with the arm in the relax position. This set was used for 
training and validation. The testing set consists of 30 repetitions recorded dur- 
ing five seconds of only the five gestures in Fig. 1(c). For every repetition, the 
volunteer started with his arm relaxed, then performs the gesture (around the 
middle of the recording), and then returns the arm to the relaxed position until 
the end of the recording. 


2.2 Methods 


Notation. In this paper, we denote the matrices with bold uppercase letters 
(e.g., A). The vectors are denoted with bold lowercase letters (e.g., x). Constants 
are denoted with uppercase letters (e.g., N) and indices are denoted with italic 
lowercase letters (e.g., 1). 


Data Acquisition. For this block, we apply a sliding window approach using 
a main window of length N. We represent the sEMG signals acquired with the 
Myo Armband and seen trough the main window as a matrix A of size N x 8, 
where 8 is the number of sensors of the Myo Armband. The value A,_; represents 
the measure in the instant of time i and from the sensor j, where i = 1,2,...,N 
and j = 1,2,...,8, respectively. Each element of the matrix A is in the range 
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[-1,1]. To generate the feature vectors for training the model, we use a main 
window MW train of length Ntrain = 400 for every repetition in the training set. To 
validate and test the model, we use a main window MW ¿ost of length Niest = 200 
with a stride of 20 points between two consecutive windows. 


Preprocessing. The sEMG signals can be modeled by a non-stationary stochas- 
tic process [13]. This means that the probability distribution of the sEMG 
changes with time. However, we can reduce the non-stationarity of the sEMG 
by smoothing out its values. The idea of this process is to reduce the changes 
of the probability distribution of the sEMG over the time assuming that the 
smoothed sEMG is locally stationary [17]. In this work, for smoothing out the 
sEMG signals we apply rectification and filtering. The preprocessing starts with 
the signal rectification using the absolute value function. Then, a Butterworth 
low-pass filter ıb of fourth order and cutoff frequency of 5 Hz is applied to A. 
Additionally, we apply a muscle activity detection function ® to the main 
window observation, which is described in [13]. The function $ returns the initial 
and final indices that contain the muscle activity within MW rain. This function 
is used to remove the head and tail that refer to the relaxed position of the hand 
for every repetition in the training set. In addition, we apply a muscle activity 
verification function 2 to the main window observation in the testing set. The 
function 2 is described in Eq. (1), where C is the observation of the signal 
rectified and Tpreprocessing 18 a threshold. If (C) is true, then the recognition 
process continues, otherwise the response is No Gesture for the main window. 


N 8 


92(C) = 5 5 Ci; > Tpreprocessing (1) 


i=1 j=1 


We apply ® only to the training set because ® returns the boundaries of the 
muscle activity. In contrast, £2 only verifies if there is or not activity within the 
main window observation. Additionaly, P increases the time of preprocessing 
compared to 92. We tested different thresholds and Tpreprocessing = 0.39 gave us 
the best results in the validation set. 


Feature Extraction. For this block, we use a sub-window SW to observe a 
segment of the signal seen trough the main window (Fig. 2(a)). The segment of 
the signal seen through the sub-window SW is represented by a matrix E of 
size M x 8; meanwhile, the signals observed through the main window MW are 
represented as a matrix A of size N x 8, where N > M. We use a stride of one 
point for two consecutive sub-windows (Fig. 2(b)). 

The features for our classifier came from two different sources: the values of 
the preprocessed signals and the results of applying a bag of functions to the 
raw signals. We only use functions from the time domain because using functions 
from the frequency and the time-frequency domains increases the computational 
cost of this block. We apply the following steps to extract feature vectors, where 
the index t represents the ith instant of the sEMG signal seen through MW: 
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Fig. 2. (a) Signals seen through both the main and the sub windows. (b) Movement 
of the sub-windows over the main window for feature extraction, classification, and 
postprocessing. (c) Process to generate a feature vector from a sub-window observation. 


1. Align the first point of the sub-window SW with the point i = 1 of the sEMG 
signal seen through the main window MW. 

2. Preprocess the sub-window observation E to get F = y(abs(E)). Convert the 

matrix F into a feature vector v; by concatenating its rows. 

Apply a bag of functions to the raw values of E to get the feature vector Z;. 

Concatenate v; with z; horizontally to get the vector x;. 

5. Move the first point of the sub-window SW to the instant i := i + 1 and 
repeat the steps from (2) to (5) until i =N -M +1. 


> 


The process for feature extraction is illustrated in Fig. 2(c). Every x; is of 
length |v;| + |z;| (where |x| denotes the length of vector x) and is associated 
with a label y; that corresponds to the gesture of the repetition from which x; 
comes from. Empirically, we found that a sub-window length of M = 75 gave 
us the highest recognition accuracy in the validation set. The length of v is 
equal to M * 8 so there is |v| = 75x 8 = 600 features. The bag of functions is 
composed of the Mean Absolute Value, Slope Sign Changes, Waveform Length, 
Root Mean Square, and the Hjorth parameters [18]. The application of these 
functions creates a vector z of 56 features. The final length of the feature vector 
x is equal to |x| = 600 + 56 = 656 features. The number of training vectors 
obtained from the sub-window observation along the main window is N— M+1 
per gesture repetition. Therefore, the total number of vectors is (N — M + 1) x 
NumberOfGestures x RepetitionsPerGesture =(N — M + 1) * 5 * 5. The number 
of vectors for training is different per user (between 2995 and 5606) because the 
length of the muscle activity varies from one repetition to the others. 

We used the t-Distributed Stochastic Neighbor Embedding (t-SNE) to visual- 
ize how the training feature vectors from each user and from each class (gesture) 
are clustered in the feature space. The results from the t-SNE applied to a 
single user are displayed in Fig. 3. We can note that when the length of the sub- 
window increases, the projected feature vectors of each class get closer to each 
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other. However, if the length of the sub-window increases, then the amount of 
feature vectors from a repetition is reduced and the length of the feature vector 
increases. This effect causes that the recognition model tends to overfitting. 


Sub-Window length N = 10 Sub-Window length N = 50 
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Fig. 3. t-SNE results from different sub-window lengths. 


Classification. In this work, we use artificial neural networks (ANN) for clas- 
sification because this family of functions are universal approximators [19]: a 
feedforward neural network with only three layers (input, hidden and output), 
with a sigmoid transfer functions and an appropriate number of nodes in the hid- 
den layer is able to approximate any function. For our model, we implemented 
an ANN with three layers and trained this network using full batch gradient 
descent, with a cross entropy cost function and 75 epochs. The input layer of 
the network has 656 nodes, which corresponds to the length of the feature vec- 
tors. After experimenting with different number of nodes in the hidden layer, 
we obtained the best recognition results in the validation set using 328 nodes, 
which is half of nodes in the input layer. The output layer has only 6 nodes, 
which corresponds to the number of predicted gestures. We tested the following 
transfer functions for the hidden layer: logsig, relu, softplus, elu and tanh. We 
obtained the best results in the validation set using the tanh transfer function. 
For training the network, we applied regularization using weight decay with a 
factor A = 750/(N — M + 1) * 5 * 5. Additionally, we applied feature scaling 
using x’ = (x; —X)./0, where X is a vector with the mean values, and ø is also a 
vector with the standard deviation values for each feature of the vector x;, and 
./ represents the element wise division between two vectors. 
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Postprocessing. For each observation of the sEMG using the main window, we 
obtain a vector of labels, where each label corresponds to the feature vector of a 
sub-window observation. We define a threshold Tpostprocessing and apply a simply 
majority voting to assign a label to the main window observation. We assign the 
label that has more than the Tpostprocessing Of occurrences in the vector of labels 
of the main window. Otherwise, we assign the label No Gesture. After testing 
different thresholds, we found that Tpostprocessing = 70% gave us the highest 
recognition accuracy in the validation set. 


3 Results and Discussion 


3.1 Evaluation Method 


In addition to evaluating the proposed model, we also evaluated a model that is 
based only on the preprocessed signal values (rectification and low pass filtering) 
and another model that is based only on the results from the bag of functions. 
Lets remember that the proposed model combines these two types of features. 

To evaluate the recognition accuracy, we trained a model for each volunteer 
using his/her own training set. Then, we used the model to predict the label 
of every repetition of the testing set using a window of length Nees; = 200 
with a stride of 20 points. The application of our method returns a vector with 
(1000 — 20) /200 = 40 labels for each repetition of the testing set. Lets remember 
that the length of every repetition of the testing set is around 1000 points. 
A recognition was considered successful when all the labels different from the 
class No Gesture match with the actual class of the repetition. Otherwise, the 
recognition was considered wrong and the label returned from the repetition 
was the first label of the vector different from No Gesture [12]. To measure the 
response time of the tested models, we used a desktop computer with an Intel 
Core i7-3770S processor and 4GB of RAM. The average time reported in this 
paper is the mean of all the times of classifying each window observation in the 
testing set. 


3.2 Results 


The confusion matrix for the proposed model is illustrated in Fig. 4. This con- 
fusion matrix shows an overall recognition accuracy of 90.7%. The gesture Fist 
was the one with the highest sensitivity (98.3%) and Double tap was the one 
with the lowest (85.3%). Regarding precision, the gesture Wave Out had the 
highest result (99.6%) and the gesture Fist the lowest (86.8%). Therefore, the 
best predictions of the proposed model are for the gesture Wave Out. On the 
other hand, the proposed model is more likely to predict the gesture Fist incor- 
rectly. Additionally, some repetitions are predicted as No Gesture because they 
did not pass the thresholds for preprocessing or postprocessing. 
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Fig. 4. Confusion matrix for the proposed model. 
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Table 1 shows that the proposed model, which uses both types of features (the 
preprocessed signal values and the results from the bag of functions), has the 
best accuracy compared to the other models. The model that uses only the 
preprocessed signal values responds quickly and its recognition accuracy is higher 
than the model that uses only the results from the bag of functions. However, the 
model that uses only the bag of functions has the lowest training time because 
its architecture is less complex. Tablel also shows that the proposed model 


responds in 29.38 ms that is much lower than the real time limit (300 ms). 


Table 1. Summary and comparative table. 


Model 


Evaluated models: 


Accuracy (%)|Response (ms) | Training (s) 


- Model using both approaches 90.7 29.38 34.78 


- Model using only the 


preprocessed signals values 88.3 2.59 29.71 


- Model only using only the 


results from the bag of functions 86.1 26.52 2.08 


Other models: 


- Private Myo Armband model [12,13] 83.1 z E 


- Model using k-NN and DTW [12] 86.0 245.50 E 


- Model using k-NN and DTW 
with muscle activity detection [13] 89.5 193.10 2 
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The results from Table1 show that the proposed model is faster than the 
models that use the Dynamic Time Warping (DTW) algorithm with k-Nearest 
Neighbor (k-NN) classifier because the feature extraction and the classification 
performed by the ANN is less computational expensive. Also, the proposed model 
overcome the other models in terms of accuracy. 


4 Conclusions 


In this paper, we have presented a hand gesture recognition model based on 
sEMG signals. The model is trained for each user and requires 5 repetitions for 
each class to recognize. The model responds in 29.38 ms, which is lower than 
the limit defined for real time (300 ms), using a computer with limited resources 
of RAM memory and processing. In addition, the model showed a recognition 
accuracy of 90.7% that is higher than the state-of-the-art (85%). 

For this model, we applied a sliding window approach using a main win- 
dow and a sub-window. The sub-window allowed us to observe a segment of the 
signal seen through the main window. The model is composed of five blocks: 
data acquisition, preprocessing, feature extraction, classification, and postpro- 
cessing. For data acquisition, we used the Myo Armband to acquire the sEMG 
signals. For preprocessing, we rectified, filtered and detected the muscle activ- 
ity in the main window observation. For feature extraction, we used two sets of 
features: the preprocessed signal values and the results from a bag of functions. 
For classification, we used an ANN of three layers to classify every sub-window 
observation. Finally, for postprocessing we applied a simple majority voting on 
the results of the ANN to decide the final gesture within the main window. 

We found that the recognition accuracy of the proposed model improves when 
we combine the values of the preprocessed signal with the results of applying a 
bag of functions. Future work includes defining a generalized model for all the 
users with high accuracy, that works in real time, and uses limited computational 
resources of RAM and processing. 
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Abstract. The alternating direction method of multipliers (ADMM) is an 
algorithm for solving large-scale data optimization problems in machine 
learning. In order to reduce the communication delay in a distributed environ- 
ment, asynchronous distributed ADMM (AD-ADMM) was proposed. However, 
due to the unbalance process arrival pattern existing in the multiprocessor 
cluster, the communication of the star structure used in AD-ADMM is ineffi- 
cient. Moreover, the load in the entire cluster is unbalanced, resulting in a 
decrease of the data processing capacity. This paper proposes a hierarchical 
parameter server communication structure (HPS) and an asynchronous dis- 
tributed ADMM (HAD-ADMM). The algorithm mitigates the unbalanced 
arrival problem through process grouping and scattered updating global vari- 
able, which basically achieves load balancing. Experiments show that the HAD- 
ADMM is highly efficient in a large-scale distributed environment and has no 
significant impact on convergence. 


Keywords: Consensus optimization * ADMM + Asynchronous 
Hierarchical communication structure 


1 Introduction 


With the rapid growth of Internet data, the performance and efficiency of a single 
computer cannot meet current computing needs. Therefore, how to solve machine 
learning problems in cluster is increasingly important. 

The alternating direction method of multipliers (ADMM) decomposes the original 
problem into sub-problems for parallel iterations. It can solve a variety of machine 
learning problems, such as SVM [1] and the optimization of neural networks [2]. 
The ADMM was first proposed by [3] and [4]. Then, [5] proved that the ADMM is 
suitable for distributed optimization problems. [6] have applied the ADMM to the 
global consensus optimization problem. [7] solves the decentralized consensus opti- 
mization problem by ADMM. 

However, in the global consensus problem, the ADMM needs to synchronize 
variables at each iteration. So network delay become the bottleneck of algorithm 
efficiency. [8] proposed an asynchronous ADMM algorithm (AD-ADMM) for the 
global consensus optimization problem. [9] and [10] added a penalty term based on [8] 
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to improve the convergence efficiency of non-convex problems. However, the AD- 
ADMM was implemented in master-slave model, whose communication efficiency is 
low in multiprocessor cluster. 

On the one hand, for the distributed environment, such as MPI, intra-node and inter- 
node communication is different greatly. This is called unbalanced arrival problem [11, 
12]. For this issue, [13] proposes RDMA-based process arrival model to optimizes 
aggregate communication, [14] uses remote shared memory to improve the communi- 
cation speed, [15] overlaps inter-node communications with intra-node communications 
through a pipelined method. On the other hand, all slaves need to communicate with the 
master. The large load of the master can be reduced through the parameter server. The 
concept of parameter server derives from [16], which uses distributed Memcached as a 
storage parameter. There are already many frameworks for parameter server, such as 
Petuum [17] and ps-lite [18], which divide the nodes into several masters and workers. 
The worker updates local parameters, and the master updates global variables. 

In this paper, in order to increase communication efficiency and achieve load 
balancing, a hierarchical parameter server structure (HPS) is designed. Besides, an 
asynchronous ADMM based on HPS (HAD-ADMM) and AD-ADMM is proposed. In 
addition, a number of simulation experiments verify that HAD-ADMM basically has 
no great impact on convergence and performs well in a large multiprocessor distributed 
environment. 


2 Distributed ADMM 


In general, many distributed machine learning problems can be expressed as the fol- 
lowing global consensus optimization problem: 


minf(x) =), fila), st. -2=0,i=1,...,N (1) 


where x € R”, fi : R” — RU {+ co}, zis the consensus variable. The local variables x; 
should be equal to each other. (1) divides the objective function f(x) into N parts, so 
this problem can be solved with N processes. Solving (1) through the ADMM is: 


yel = argmin (fix) +3" (6 2) + 5 px 25) (2a) 
2+! = argmin (GE) + yf" 04+! — 2) + Epa l) (2b) 
oft! = y+ p(s Ar) a 


where y; is the Lagrangian multipliers, p > 0 is the penalty parameter. 

According to (2a), (2b), (2c), x and y can update independently across N processes, 
while z needs to aggregate all the local variables in cluster. So the network delay is 
high. Therefore, the AD-ADMM [8] is proposed to reduce the time overhead by partial 
barrier and bounded delay. 
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2.1 Asynchronous Distributed ADMM 


The AD-ADMM divides processes into one master and N workers. The master does 
not have to wait for all workers, but receives parameters from A workers, O<A<N, i.e. 
partial barrier. In order to guarantee convergence, the AD-ADMM constrains the 
staleness to a certain range, i.e. bounded delay. The AD-ADMM sets a clock k for each 
process, and the clock increases after each iteration. Master should wait workers whose 
clock is greater than t > 0. The AD-ADMM is given in Table 1. 


Table 1. Asynchronous distributed ADMM (AD-ADMM). 


Algorithm 1: AD-ADMM 
Master: 

1 initialize: z, k = 0, d4 = d; = = = dy = 0. 
2: broadcast z to all Workers. 

3: repeat 
4: 
5 


wait until receiving {X,, 9,) from Workers i, i € A, such that |A,| > A and Vi E Aj, d; < T. 
update 
„er = {ee E Ax k+1 = {7% E Ax , ={ 1,i E Ak 
i xkie ac ti y, icag ld, + Lie AR 
x k+1)T p 2 
zk+1 = argmin (Ae) + y. ) (az) + a [lx - z). 
zZ 


6: broadcast z**+ to the Workers i, i € Ax. 
T: setkek+1. 
8 until the stopping criterion is satisfied. 
9: output z*. 

the ith Worker: 

1: initialize: x;, y;, k; = 0. 


2 repeat 
3: wait until receiving 2 from Master. 
4 update 
k¡+1 . kiT 
xi? = argmin (A) + yf" (e; — 2) +2 lla — 2113), 
Xi 
k¡+1 Ki ki+1 
y =y" +o(x 2). 
a ki+1 >, ki+1 
5: send {x;‘ y, }to Master. 
6: set k; e k; +1. 


7: until the stopping criterion is satisfied. 
where {d1,d2...dy} records the clock of N workers’ last arrival. Az is the set of 
workers which is reached in the clock k. A; is the complement of Ax. 


2.2 Star Communication Topology 


The AD-ADMM is based on the master-slave model, which adopts a star structure. 
This section analyzes the problems in the star structure. We start with some definitions: 


Definition 1. In a cluster with N, nodes, each node has M;>0 workers, 
i€ 1, 2,...N,. There is only one master in the entire cluster on the ¿th node. There are 


N workers in the cluster, i.e. 5yr M;=N. 


Because of the process arrival pattern in MPI, the Mz workers in the ¿th node must 
wait for other workers. In addition, the master must communicate with the N workers, 
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causing network congestion. Finally, the master needs to store N worker parameters, 
which is a big challenge. For these issues, this paper proposes the HPS structure. 


3 Asynchronous Distributed ADMM Based on Hierarchical 
Parameter Server 


3.1 Hierarchical Parameter Server 


This paper expands the parameter server into HPS through process grouping. Similar to 
[15], the communication of intra-node and inter-node is distinguished. 


Processing Grouping. HPS associates processes with the node, and sets a master in 
each node called submaster. And a master is set up to communicate with each sub- 
master. The workers only communicate with their own submasters. The submaster only 
communicates with the master and the workers. Therefore, except the communication 
between submasters and master, the rest is the intra-node communication. When the 
node size is large, HPS can effectively reduce the times of inter-node communication. 


Update Strategy. Every submaster store variables from workers on the same node, 
and uses these parameters to update z. The master storages and aggregates parameters 
from submasters. This strategy greatly reduces the load of the master. The HPS 
topology and the star topology is shown in Fig. 1: 


1 

| En 
o E a ees A ee aea 
EE [ewes] [orites} [nears | 
m es m 
m um 


Star topology. HPS topology. 


Fig. 1. The star and HPS topology. 


3.2 Asynchronous Distributed ADMM Based on HPS 


HAD-ADMM is similar to AD-ADMM, but provides a new update strategy based on 
HPS. The clock of submaster is equal to workers belong to it. 


Updating x, and y; by Worker. HAD-ADMM updates y first, updates x secondly, 
and finally transfers variables to the submaster. Otherwise, the dual variable y sent by 
the worker is the result of the kth iteration. The worker procedure only changed the 
update order compared to AD-ADMM. And the subscript of x and y means the jth 
Worker in ith SubMaster. 
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Aggregation x; and y;; by SubMaster. From (2b), we have that 


ly 1 
k+1 _ k+1 k 
z = (+ x; +24) (3) 


(3) is separable. The submaster updates the global variable z dispersed, and then sends 


zu +1 to master. The procedure of submaster is shown in Table 2. 


Table 2. Asynchronous distributed ADMM based on HPS (HAD-ADMM) — submaster 


Algorithm 3 HAD-ADMM - the ith SubMaster: 


1 initialize: x;, y;, P;, k; = 0. 
2 repeat 
3: wait Z from Master. 
4: broadcast Z to the Worker j, j € P;. 
5 wait ei, Fab (ha Nah, im» and from all Workers in P;. 
6 compute: 

zit ki+1 |, 1 ‚kit 

ja +) 

% send a 1 to Master. 
8: set k; — k; +1. 


9: until the stopping criterion is satisfied. 


where P; is the set of workers in ith node. 


Update z by Master. The master receives and aggregates ze +1 And finally sends z to 
each submaster. The procedure of master is shown in Table 3: 


Table 3. Asynchronous distributed ADMM based on HPS (HAD-ADMM) — master 


Algorithm 4 HAD-ADMM - master: 


1: initialize: z, k = 0, dy = dy = = = dy, = 0. 

2: broadcast z to all SubMasters. 

3: repeat 

4 wait until receiving (Z;, ...,Zy, } from SubMaster i, i € A’, such that |A’,| > An and 


vi € A di <T. 


5: update 
e 7 T 

z+ E Ar 

! Zz, i EAk 

d= l Lied’, 

ld; +116 A 

gkt1 = Due, 
broadcast z**! to the SubMaster i, i € A’ x. 
setk-k+1. 


a a 


until the stopping criterion is satisfied. 
9: output z*. 


where A; is the set of submaster arrived when the clock is k, Ay < EM <N,. 
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The algorithm procedure of HAD-ADMM is shown in Fig. 2, the processes in the 
same color in the figure are on the same node. 


SubMaster 1 ----| 
SubMaster 2 A MEAN fi 
Worker I EEE 
Worker? MEN- 
Worker 3 


Worker 4 


Fig. 2. Asynchronous distributed ADMM based on HPS (HAD-ADMM) (Color figure online) 


4 Convergence and Performance Analysis 


4.1 Convergence Analysis 
First, a definition of the relevant variables is given in the following paragraph. 


Definition 2. Assume that the clock k has run for T iterations, and T; is the number of 


iterations when the clock of the ith worker is k;. zi is the È received by ith worker at its 
kith iteration. x; is the average of x; throughout its T; iterations. Similarly, z is the 
average of z through T iterations. 


[8] proves that Theorem | is practical under Assumption 1. 


Assumption 1 [8]. At any master iteration k, updates of the N workers have the same 
probability of arriving at the master. 


Theorem 1 [8]. Let (x*, z*) be the optimal solution of problem (1), and y; is the 
optimal dual variable in ith worker. Then 
} 


(4) 


\ No = E Nt N wy2 1 N 
[Duo 10) ona] a alld -el 


where 2 and y? are the initial values of z; and y,. 


In other words, the convergence rate of AD-ADMM is 0). HAD-ADMM only 


changes A into >> Mj, i € Ay. Therefore, under Assumption 2, the convergence rate 
of HAD-ADMM is basically the same as AD-ADMM. 


ja: 


Assumption 2. At any master iteration, when A = Sr, Mi,i € Al, >, Mi > |Arl. 
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4.2 Performance Analysis 


In order to simplify the analysis, assume that the master in the cluster is on the first 
node. 


Star Topology. The time required for one iteration Tsrar is: 


(Axl 
Istar = leales + 2 >. twm; (5) 


where feales is the compute time. twm, is the communication time of master and ith 
worker, i = 1, 2,..., |A;|. If the master and the worker are on the same node, let the 
communication time be tinta, Whereas the communication time iS tinter. Therefore, 


Tstar I teatcs + 2(M, tintra + (|Ag| = Mı \tinter| (6) 


HPS Topology. The time Typs required by one iteration is: 


Typs = = tealch +2 (55 | tsm; + ye | S tun) (7) 


where fy,,s, is the communication time between the worker and the ¡th submaster, tsm; is 
the communication time between ¡th submaster and master. Similarly, if the master and 
the submaster are on the same node, let the communication time be fa, otherwise the 
communication time is Z,,.,. In addition, let the communication time between the 


submaster and the worker be t „a. Thus 


Typs ~ Lealch + 2 (Ja; = Dx, boier nt +56 | > a bua (8) 


where i € A}. In the same cluster, teates © tealels Uh ing I tintra, Make Tsar — Tups, SO 
AT = 2(|A;,| z 1) (Mtiner E Ed) + 2|(M — MIA, |) tintra z en (9) 


Since the submaster in the HAD-ADMM only sends one variable to the master, and 
workers in the AD-ADMM needs to send two variables to the master. So, trer < tinter- 
Similarly, t; intra linras Let AT > 0, then 


1 
linter — lintra > M(|A\| = 1) (ae Entra) = = u(t, inter + tnra) (10) 


If EM =1, AT<0. When EM > 1, u<1. Therefore, when u is small enough, 
AT > 0. 
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5 Experiment 


In order to test the convergence and performance of HAD-ADMM, a simulation 
experiment was carried. 

The data is set as T = {(aı, B,), (%2, B,),..., (as, Bs)), where a; € RS is feature 
vector, p; € {0, 1} is label, T is evenly distributed over N nodes. So the global con- 
sensus optimization problem of LR problem is: 


min 0" Lw) =D, Y, [B0 w) 108 (1+exp(wi-9)))] (11a) 
stw-=z=0,i=1,...,N (11b) 


In this paper, two clusters are used. One cluster has 8 compute nodes with fast 
Ethernet. The other has 16 nodes with Gigabit Ethernet. There are 4 cores and 8 GB 
memory in each node. In addition, the data set is a sparse set with dimension s = 
10000000 and size S = 43264. The algorithm is implemented in C++ and MPICH 
v3.2. For each worker, L-BFGS is chosen to solve (2a). The penalty parameter p = 1. 
The stopping criterion is that the residual r* and s* [5] satisfy: 


I], < 10/54 10~*max{ Iwll, lla} 


Al], < 1025+10], (12) 
where 


2 N 2 2 1112 
I" T= 2... wi 4H, llla Noelle! — "Ib (13) 


i=l 


5.1 Convergence Test 


Figure 3 shows the dual residual variation with the number of iteration. Assumption 1 
[8] and Assumption 2 are established in the experiment. In some cases, it was found that 
workers that reached the master process on each iteration of the HAD-ADMM were 
basically the same as the AD-ADMM. Even under different conditions, the convergence 
of the two algorithms is similar. This is consistent with the analysis in Sect. 4.1. 
Therefore, the performance of HAD-ADMM is mainly related to from Eq. 10. 


HAD-ADMM 


Fig. 3. The convergence of HAD-ADMM and AD-ADMM. 
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5.2 Performance Test 


It can be seen from Fig. 4(a) that the value of t has a great influence on the running 
time. There is a big difference of t = 4 and t = 8, which is the characteristic of the 
asynchronous algorithm. When N, = 2 and N, = 4, the effect of the u on run time is 
consistent with the analysis in Sect. 4.2. In the cluster used in this paper, the HAD- 
ADMM has a shorter running time if u < 1/3. When N, = 8, has little effect. Under 
this condition, the algorithm runtime of HAD-ADMM is much shorter than 
AD-ADMM. The reason may be when the number of nodes is large, the communi- 
cation load of the Master in the AD-ADMM has greater influence. 


2500 
maza maaan Man (022 JAD (RALAN (2324470) mazaum 1934410 


a. Fast Ethernet b. Gigabit Ethernet 


E 
a 


Fig. 4. The runtime of HAD-ADMM and AD-ADMM. 


Figure 4(b) shows experiments on Gigabit Ethernet, which t = 4 and Az = 0.5M. 
It can be seen that HAD-ADMM is better than AD-ADMM obviously when p< 1/5. 
Because of the smaller tinter — tintra On Gigabit Ethernet, smaller u is needed to maintain 
AT <0. 


6 Conclusion 


Aiming at AD-ADMM and MPI, this paper proposes HPS structure based on the 
parameter server, which reduces the inter-node communication through processing 
grouping and balance the load through scattered update. In addition, this paper pro- 
poses the HAD-ADMM based on AD-ADMM, and analyzes the convergence and 
performance in experiment. Experiments show that HAD-ADMM performs better in 
large-scale distributed clusters. In the future, application on other distributed algorithm 
based on HPS will be paid more attention. 


Acknowledgements. This research was supported in part by Innovation Research program of 
Shanghai Municipal Education Commission under Grant 12ZZ094, and High-tech R&D Program 
of China under Grant 2009AA012201, and Shanghai Academic Leading Discipline Project 
J50103, and ZiQiang 4000 experimental environment of Shanghai University. 


Fast Communication Structure for Asynchronous Distributed ADMM 371 


References 


10. 


11. 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


. Chen, Q., Cao, F.: Distributed support vector machine in master-slave mode. Neural Netw. 


Off. J. Int. Neural Netw. Soc. 101, 94 (2018) 


. Taylor, G., Burmeister, R., Xu, Z., et al.: Training neural networks without gradients: a 


scalable ADMM approach. In: International Conference on International Conference on 
Machine Learning, pp. 2722-2731. JMLR.org (2016) 


. Glowinski, R., Marrocco, A.: On the solution of a class of non linear Dirichlet problems by a 


penalty-duality method and finite elements of order one. In: Marchuk, G.I. (ed.) 
Optimization Techniques IFIP Technical Conference. LNCS. Springer, Heidelberg (1975). 
https://doi.org/10.1007/978-3-662-38527-2_45 


. Gabay, D., Mercier, B.: A dual algorithm for the solution of nonlinear variational problems 


via finite element approximation. Comput. Math Appl. 2(1), 17-40 (1976) 


. Boyd, S., Parikh, N., Chu, E., et al.: Distributed optimization and statistical learning via the 


alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1-122 (2010) 


. Lin, T., Ma, S., Zhang, S.: On the global linear convergence of the ADMM with multi-block 


variables. SIAM J. Optim. 25(3), 1478-1497 (2014) 


. Wang, Y., Yin, W., Zeng, J.: Global convergence of ADMM in nonconvex nonsmooth 


optimization. J. Sci. Comput., 1-35 (2018) 


. Zhang, R., Kwok, J.T.: Asynchronous distributed ADMM for consensus optimization. In: 


International Conference on Machine Learning, pp. I-1701. JMLR.org (2014) 


. Chang, T.H., Hong, M., Liao, W.C., et al.: Asynchronous distributed alternating direction 


method of multipliers: algorithm and convergence analysis. In: IEEE International 
Conference on Acoustics, Speech and Signal Processing, pp. 4781-4785. IEEE (2016) 
Chang, T.H., Liao, W.C., Hong, M., et al.: Asynchronous distributed ADMM for large-scale 
optimization—Part II: linear convergence analysis and numerical performance. IEEE Trans. 
Signal Process. 64(12), 3131-3144 (2016) 

Faraj, A., Patarasuk, P., Yuan, X.: A study of process arrival patterns for MPI collective 
operations. In: International Conference on Supercomputing, pp. 168-179. ACM (2007) 
Patarasuk, P., Yuan, X.: Efficient MPI Bcast across different process arrival patterns. In: 
IEEE International Symposium on Parallel and Distributed Processing, pp. 1-11. IEEE 
(2009) 

Qian, Y., Afsahi, A.: Process arrival pattern aware alltoall and allgather on InfiniBand 
clusters. Int. J. Parallel Program. 39(4), 473-493 (2011) 

Tipparaju, V., Nieplocha, J., Panda, D.: Fast collective operations using shared and remote 
memory access protocols on clusters. In: International Parallel & Distributed Processing 
Symposium, p. 84a (2003) 

Liu, Z.Q., Song, J.Q., Lu, F.S., et al.: Optimizing method for improving the performance of 
MPI broadcast under unbalanced process arrival patterns. J. Softw. 22(10), 2509-2522 
(2011) 

Smola, A., Narayanamurthy, S.: An architecture for parallel topic models. VLDB Endow. 3, 
703-710 (2010) 

Xing, E.P., Ho, Q., Dai, W., et al.: Petuum: a new platform for distributed machine learning 
on big data. In: ACM SIGKDD International Conference on Knowledge Discovery & Data 
Mining, pp. 1335-1344. IEEE (2015) 

Li, M., Zhou, L., Yang, Z., Li, A., Xia, F.: Parameter server for distributed machine learning. 
In: Big Learning Workshop, pp. 1-10 (2013) 


Check for 
updates 


Improved Personalized Rankings Using 
Implicit Feedback 


Josef Feigl%) and Martin Bogdan 


Department of Computer Engineering, University of Leipzig, 
Augustusplatz 10, 04109 Leipzig, Germany 
{feigl,bogdan}@informatik.uni-leipzig.de 


Abstract. Most users give feedback through a mixture of implicit and 
explicit information when interacting with websites. Recommender sys- 
tems should use both sources of information to improve personalized 
recommendations. In this paper, it is shown how to integrate implicit 
feedback information in form of pairwise item rankings into a neural 
network model to improve personalized item recommendations. The pro- 
posed two-sided approach allows the model to be trained even for users 
where no explicit feedback is available. This is especially useful to allevi- 
ate a form of the new user cold-start problem. The experiments indicate 
an improved predictive performance especially for the task of personal- 
ized ranking. 


Keywords: Personalized ranking - Neural networks 
Collaborative filtering - Implicit feedback 


1 Introduction 


Personalized feedback about user preferences is mostly limited to clicks, pur- 
chases or other forms of implicit information. It is rather uncommon that users 
give explicit feedback, for example in form of ratings. Recommender systems for 
both types of information are well covered in the collaborative filtering literature 
[7,10]. However, a more realistic problem is given when dealing with a mixture 
of both sources of information. This is especially interesting when information 
about most users is limited to implicit feedback. 

This paper builds on the results of [4] and [5] aiming to make use of both 
sources of information to improve the predictions of explicit user preferences. 
Therefore, our proposed neural network model integrates implicit feedback by 
learning additional user-specific pairwise item preferences, similar to the popular 
Bayesian Personalized Ranking criterion (BPR) [14]. Aside from the increased 
predictive performance of this approach, the model can thus also be trained for 
users where no explicit information is present. This is useful to ease a form of 
the common cold-start problem for new users. 

Therefore, the main contributions of this paper are (i) to show a novel way of 
integrating implicit feedback in a recommender system using pairwise rankings, 
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(ii) to introduce mixed feedback dataset and show how to deal with them and (iii) 
to evaluate the impact of implicit feedback for personalized ranking. This paper 
is structured as followed: A brief description of the general problem is given in 
Sect. 2. Afterwards in Sect. 3, we give an overview of the proposed neural network 
architecture. In Sect. 4, we detail how to train the model. The proposed model 
is evaluated in Sect. 5. We summarize our findings in Sect. 6. 


2 Preliminaries 


Let U = {1,...,N} be a set of users and I = {1,..., M} a set of items with 
N,M EN. The set of all ratings is given by R = {-1,0,1}, where the value 1 is 
given if a user liked the item and vice versa for 0. The value —1 highlights that 
no explicit information is available for this user-item tuple. 

We have a dataset of observed interactions S with 


eur A (1) 
where 
ser = {(u,i,r)|ueU,iel,re R} (2) 
defines the set of all explicit feedback information and 
gimpl -— {(y,i,-1)|ueU,ie I} (3) 


defines the set of all implicit information. For each sample of S$””?!, the user 
interacted with an item in some way but did not explicitly assign it a rating. 
Both datasets can easily be visualized by a table with three columns (see Table 1). 
We are calling an item i positive for user u if this user had some kind of 
interaction with the item. Let I} be the set of all positive items of user u. It is 
defined as: 
I} := {i | (u,i,r) € S} (4) 
Therefore, all interactions with an explicit rating, even if the rating was negative, 
are also considered as positive feedback. Analogous to the definition above, we 


use I, for the set of all negative items, e.g. all items, user u had no interaction 
with [5]. 


Table 1. Training data (left): The rating value of —1 highlights that no explicit infor- 
mation was available. User 1 has explicit as well as implicit information in his training 
data. User 2 has only implicit data. Test data (right): Explicit ratings have to be 
predicted for both users. 


User Item Rating User Item Rating 
1 1 0 1 4 0 
1 2 1 2 2 1 
1 3 -1 2 3 0 
2 1 -1 = 
2 4 -1 
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3 Model 


3.1 Main Idea 


Common matrix factorization models learn a set of latent user and latent item 
factors to predict a target. Our model learns an additional set of item factors: 
one for the explicit and one for the implicit information in the dataset. The final 
target prediction of our model is a weighted average of two separate predictions: 
one using the user factors in combination with the explicit item factors and one 
using the user factors in combination with the implicit item factors. 

Therefore, our model consists of one part to train the explicit item factors 
and one part to train the implicit item factors. However, both parts share the 
same user factors. Each part will update their relevant item factors, but both 
parts will use and update the same user factors. 

While the explicit item factors are updated using all available explicit feed- 
back information (similar to most matrix factorization models), the implicit 
item factors are trained to rank positive and negative items. This is similar to a 
matrix factorization model using the BPR criterion (BPRM F) [14]. Therefore, 
our model is a combination of a biased regularized matrix factorization model 
(BRMF) [4] anda BPRMF model. 


3.2 Model Overview 


Our proposed network consists of two parts: one part to process the explicit 
feedback and one for the implicit feedback (see Fig. 1). The network is a con- 
catenation of five specific layers L: An user layer L! with N units. This layer has 
as many units as there are users and is responsible for learning the user represen- 
tations. The next layer is the hidden layer L? with K units, which determines 
the size of all learned representations. The following item layer L° holds the 
explicit and implicit item representations. It has 2- M units. The second to last 
layer is the bias layer L*, which is responsible for dealing with user, item and 
global biases. The last layer L° is a combination layer, which merges the outputs 
of the explicit as well as the implicit part of the network. 


3.3 Notation 


The following short notations are used in this paper: let U be the set of weights 
connecting the user layer to the hidden layer. It can be represented as a weight 
matrix U € RY*K, A single representation of user u is given by the weights 
connecting unit u of the user layer with all units of the hidden layer. We use the 
notation U, for this single user representation [4]. 

Let IP! € R**™ be the set of weights connecting the hidden layer to 
the explicit item layer. Analogous to the user layer, we use 15%?” to define the 
explicit representation for the item 7. Similar notations are used for the implicit 
item weights 17? € R**™, Additionally, a! defines the activation of layer 1 [5]. 
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User Layer Hidden Layer Item Layer Bias Layer Combination Layer 


Fig. 1. The upper part of the network leading to rui handles the prediction of explicit 
ratings. The final prediction of this part is a weighted average of one prediction using 
the implicit item factors (green) and one using the explicit item factors (red). The 
lower part leading to £uij updates the implicit item factors by learning to rank user- 
specific positive and negative items. The letter b symbolizes the addition of biases to 
the activation of the item layer. (Color figure online) 


We use rui as a short notation for the rating r given by user u for item i. 
The prediction fu; for this rating is made by the explicit part of our model. The 
implicit part of the network measures the difference between the preferences of 
a positive and a negative item of user u. We use x; as the measure of preference 
to determine how much user u likes item 7. Therefore, user u prefers item i over 
item j if Lui > Tuj- The output of the implicit part of the network £uij is given 
by the probability that user u prefers item ¿ over item j [5]. 


4 ‘Training 


We use a two-sided approach to train the network. All explicit samples are used 
to train the explicit part of the network. The implicit part of the network is 
trained by learning to rank positive and negative items. To do this, we need two 
separate training sets. 


4.1 Preparation of Training Sets 


To train the network, we need training samples for the explicit as well as the 
implicit part of the model (see Fig. 2). All samples for the explicit part are given 
in $°*?!, For consistency, we use the notation T°?! := $¢*?! for this set. 

The training samples for the implicit part 7”! are created using the fol- 
lowing process: to create a set of p training samples, we choose a uniformly 
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User Item Rating 
1 t 0 
dE 2 1 
1: 3 -1 
2 A -1 
User Item Rating 2 a ce User Positive Negative 
Item Item 
1 1 0 
1 1 4 
1 2 1 
1 3 4 
2 1: 2 
2 4 3 


Fig. 2. Starting from the full training data (middle table), we create explicit (left table) 
and implicit training sets (right table). The explicit training set consists of all available 
explicit samples. The implicit training set is sampled from user-specific positive and 
negative items. 


randomly selected set of p users u € U (with replacement). For each user, one 
of his positive items i € I} and one of his negative items j € I, is randomly 
selected (uniformly distributed with replacement): 


Peel |u € U,i € I$, j Er} ©) 


A model should therefore learn to rank £u; above x; for each training triple 
(u,i,7) € TP, 


4.2 Explicit Part 


Let (u,i,r) € T°*?! be a single training triple for the explicit part of the network, 
where u € U is a user, i € J is an item and r € {0,1} is a rating. 

A binarized version a! = 1, € (0,1)% of u is used as the input for the 
network. It is defined as the indicator vector 1. := (Z0,%1,:-* ,2n) with zj = 1 
if j = u and z; = 0 otherwise. Using 1, as input for the network implies that 
only the weights U,, contribute anything to the input of the hidden layer [4]. 
The output a? e R* is therefore given by: 


al=U-.a!l 
=U.. (6) 


We select the implicit and explicit weights for item i to compute the output 
of the item layer aĉ: 


al = a? . 1°77! (7) 
a? = a? . Tim? (8) 
In our evaluation, we found no benefit from using anything other than identity 


activation functions for the hidden and item layer. We are therefore omitting the 
notation of activation functions for these two layers. 
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The following bias layer is responsible for adding a user bias, an explicit item 
bias and a global bias to the previous output. The output of the explicit part of 
our network fai is therefore given by: 


Fan = (wy - pezpl + we . PEPI) (9) 
with 

pee! — — fla? i bu y pa de ba), (10) 

pino _ = f(a} + bu +0," + by). (11) 


The function f : R — R is the activation function of the bias layer. For the 
combination layer, we use the logistic sigmoid activation function o to get the 
probability estimate that user u likes item 2. 

After forward-propagating, we compare the prediction fu; with the target 
Tui and back-propagate the loss rui — fy; using the common cross entropy cost 
function [2,15]. We are updating all weights except the implicit item weights 
and the implicit item bias, which get updated during the training of the implicit 
part of the network. 

We achieved our best results using the weights wı = 0.5 and wa = 0.5 instead 
of letting the network learn them. This way the network is forced to use both 
parts of the network equally. Using wı = 1 and wa = 0 disables the implicit part 
and reduces our model to a BRMF model [4,17]. 


4.3 Implicit Part 


Let (u,i,j) € 7"?! be a single training triple for the implicit part, where u € U 
is a user, 7 € I} is a positive item and j € I} is a negative item for this user. 
Similarly to the training of the explicit part of our network, feed-forwarding this 
sample through the implicit part of the network yields: 


tug = (te E) (12) 
with 
gem l jr l impl 
i = UL + bu +0," + bg), (13) 
gm l er l impl 
Jo = f(Uu LL" +6,42,” + bg). (14) 
Again, we use the logistic sigmoid activation function o to get the probability 


estimate that user u prefers item ¿ over item j. 
The training samples 77”! are missing target values y in the classical 
machine learning sense, but our training set is constructed in such a way that 
amp! > En ' for each sample (Subsect. 4.1). This means that, the measure of 
preference of a positive item is always greater than this measure for a negative 
item. Learning to maximize the probability uij is sufficient to achieve this goal 
and we can therefore set y = 1 for every training sample (also see [5]). Again, 
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using the cross entropy cost function, we back-propagate the loss 1 — ĉuij and 
update the user weights, the implicit item weights and the implicit item bias. 

The constant weights of the combination layer force the implicit part of our 
network to learn pairwise item rankings. This part of the model is therefore 
equivalent to a BPRMF model [5,14]. 


4.4 Mini-Batch-Processing 


For each training epoch, we have a set of [er samples for the explicit part and 
a set of [T"P!| samples for the implicit part of our model. Instead of processing a 
single sample at a time, we split each set into mini-batches of P samples. During 
each training epoch, we process all available mini-batches in a random order, 
which helps to improve convergence of both parts of the network. An epoch is 
finished once all mini-batches were processed. We create a new set of training 
samples JT’?! for each training epoch. 

Using the set of negative items I} to create T’””! can be memory-consuming 
and computationally slow. Since most users interact only with a small percentage 
of all items, we found it to be sufficient to sample item from all possible items 
instead of using 1} . We found no significant loss of predictive performance using 
this approximate approach. 


5 Experiments and Results 


5.1 Setting 


The MovieLens 1M dataset [6] and the Netflix Prize dataset [1] are used to 
evaluate our model. Since both datasets contain explicit movie ratings in the 
range [1,5], we convert these ratings to binary targets by checking if the rating 
is above or equal to 3. 

To simulate the situation where users have only provided few or even no 
explicit feedback information, we create multiple mixed variants of these two 
datasets. The following process was used to create all benchmark datasets: at 
first, a given percentage s of all explicit ratings are dropped. Afterwards, all 
explicit ratings of t percent of all users are dropped. This way, t percent of all 
users have only implicit information left and the remaining users lose about s per- 
cent of their provided explicit information. We use the short notation ML(s, t) 
and Netflix(s,t) to denote all benchmark datasets, which were created using 
the explained process on the Movielens 1M and Netflix Prize dataset, respec- 
tively. Using this notation, ML(0,0) and Netflix(0,0) simply refer to the full 
datasets. 

For the Movielens 1M dataset, we used a 5-fold cross-validation. The Netflix 
Prize dataset comes with a predefined probe dataset, which we use as test set to 
validate all predictions. To speed up computation, we randomly selected 10000 
out of 480189 users of the Netflix Prize dataset in each run. The process to 
create the benchmark datasets was applied on the training data of each run. The 
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test data was left untouched since we want to benchmark our model predicting 

explicit ratings even if there was no explicit information in the training data of 

a user. We did a total of five runs for each dataset and averaged the results. 
Our model is compared against two popular baseline models: 


BRME. A biased regularized matrix factorization model, which is implemented 
using the explicit part of our model (see 4.2). This model is especially useful 
as a fair comparison with our full network to directly evaluate the impact of 
the integration of implicit information. 

FM. A factorization machine was used as the second baseline model [12]. The 
results for this model were computed using the open-source library libF M 
[13]. 


We are using three metrics to evaluate the model performance: The Area 
Under the Receiver Operating Characteristic Curve (AUC) to measure the rank- 
ing quality [3], logistic loss (LogLoss) and Accuracy to measure the general pre- 
dictive performance. 


5.2 Network Initialization Details 


The user weights and the explicit and implicit item weights are initialized with 
uniformly distributed random numbers from the range [—0.01, 0.01]. We are 
using a SELU activation function in the bias layer [9] and two Adam optimizer 
[8]: one for the explicit and one for the implicit part of the network. To regularize 
the network, we use L2 [11] and max-norm regularization [16] for all weights. 


5.3 Results 


The evaluation results of all models for the movielens 1M datasets can be found 
in Table2 and for the Netflix Prize datasets in Table 3. 

Our model achieves a significantly improved predictive performance com- 
pared to the BRM F model on all metrics and on all datasets. This is especially 


Table 2. Evaluation results for the Movielens 1M dataset 


Metric | Model ML (0, 0) | ML (0.5, 0.25) | ML (0.5, 0.5) | ML (0.5, 0.75) 
AUC |BRMF 10.8216 0.7830 0.7633 0.7382 

FM 0.8248 0.7877 0.7668 0.7410 

Our Model 0.8249 0.7901 0.7709 0.7455 
LogLoss |BRMF 10.5196 0.5574 0.5737 0.5929 

FM 0.5032 0.5438 0.5642 0.5885 

Our Model 0.5119 0.5466 0.5635 0.5860 
Accuracy | BRMF 10.7470 0.7173 0.7032 0.6855 

FM 0.7504 0.7217 0.7067 0.6890 

Our Model | 0.7512 0.7233 0.7089 0.6918 
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Table 3. Evaluation results for the netflix prize dataset 


Metric Model Netflix | Netflix Netflix Netflix 
(0, 0) (0.5, 0.25) | (0.5, 0.5) | (0.5, 0.75) 
AUC BRMF 0.7844 | 0.7446 0.7181 0.6868 
FM 0.7879 | 0.7482 0.7211 0.6889 
Our Model 0.7871 | 0.7486 0.7237 0.6969 
LogLoss | BRMF 0.5443 | 0.5812 0.6001 0.6194 
FM 0.5373 | 0.5751 0.5955 0.6171 
Our Model 0.5416 | 0.5784 0.5975 0.6151 
Accuracy | BRMF 0.7221 | 0.6906 0.6733 0.6513 
FM 0.7279 | 0.6942 0.6733 0.6521 
Our Model 0.7255 | 0.6950 0.6756 0.6583 


interesting since both models share many similarities, with the only difference 
being the integration of implicit feedback using pairwise item rankings. 

It can also be seen, that the FM model performs significantly better than 
the BRMF model. This is no surprise, since the FM model can easily mimic 
most matrix factorization models [12]. 

Our model performs consistently better or at least equally good than the FM 
model on the AUC and Accuracy metrics. The difference between both models 
also gets larger the more of the explicit information is dropped from the dataset. 
This is to be expected, because our model can still use the remaining implicit 
information. It can also be seen, that integrating implicit information in form of 
pairwise item rankings is especially beneficial for the AUC metric. This is due 
to the fact that the implicit part of our model is basically a matrix factorization 
model using the BPR criterion, which is well suited to optimize AUC [12]. 

The FM model performs especially better than our model regarding the 
LogLoss metric on both full datasets. Nevertheless, integrating the implicit infor- 
mation helps to close this gap and enables our model to perform even stronger 
than the FM model regarding the LogLoss metric on the sparser mixed datasets. 


6 Summary 


In this paper, we have proposed a neural network recommender system to solve 
collaborative filtering problems where users give feedback through a mixture 
of implicit and explicit information and in particular the case where all infor- 
mation about most users is limited to implicit feedback. Our model integrates 
implicit information by additionally learning personalized item rankings using 
the Bayesian Personalized Ranking criterion. These features are further used 
to influence the processing of the explicit information. This two-sided approach 
enables the model to be trained for users that never gave any explicit feedback, 
which is useful to improve recommendations and alleviate the cold start problem 
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for new users. It was shown that integrating implicit feedback using our proposed 
approach leads to an increase of predictive performance especially for the task 
of personalized ranking. 
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Abstract. Traditionally, multi-layer neural networks use dot product between 
the output vector of previous layer and the incoming weight vector as the input 
to activation function. The result of dot product is unbounded, thus increases the 
risk of large variance. Large variance of neuron makes the model sensitive to the 
change of input distribution, thus results in poor generalization, and aggravates 
the internal covariate shift which slows down the training. To bound dot product 
and decrease the variance, we propose to use cosine similarity or centered cosine 
similarity (Pearson Correlation Coefficient) instead of dot product in neural 
networks, which we call cosine normalization. We compare cosine normaliza- 
tion with batch, weight and layer normalization in fully-connected neural net- 
works, convolutional networks on the data sets of MNIST, 20NEWS GROUP, 
CIFAR-10/100, SVHN. Experiments show that cosine normalization achieves 
better performance than other normalization techniques. 


Keywords: Neural networks - Cosine similarity - Cosine normalization 


1 Introduction 


Deep neural networks have received great success in recent years in many areas. 
Training deep neural networks is nontrivial task. Gradient descent is commonly used to 
train neural networks. However, due to gradient vanishing problem [1], it works badly 
when directly applying to deep networks. 

In previous work, multi-layer neural networks use dot product (also called inner 
product) between the output vector of previous layer and the incoming weight vector as 
the input to activation function. 


net=wex (1) 


where net is the input to activation function (pre-activation), w is the incoming weight 
vector, and x is the input vector which is also the output vector of previous layer, e 
indicates dot product. Equation 1 can be rewritten as Eq. 2, where cos 0 is the cosine of 
angle between w and x, || is the Euclidean norm of vector. 
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net = |w||x| cos 0 (2) 


The result of dot product is unbounded, thus increases the risk of large variance. 
Large variance of neuron makes the model sensitive to the change of input distribution, 
thus results in poor generalization. Large variance could also aggravate the internal 
covariate shift which slows down the training [2]. Using small weights can alleviate 
this problem. Weight decay (L2-norm) [3] and max normalization (max-norm) [4, 5] 
are methods that could decrease the weights. Batch normalization [2] uses statistics 
calculated from mini-batch training examples to normalize the result of dot product, 
while layer normalization [6] uses statistics from the same layer on a single training 
case. The variance can be constrained within certain range using batch or layer nor- 
malization. Weight normalization [7] re-parameterizes the weight vector by dividing its 
norm, thus partially bounds the result of dot product. 

To thoroughly bound dot product, a straight-forward idea is to use cosine similarity. 
Similarity (or distance) based methods are widely used in data mining and machine 
learning [8]. Particularly, cosine similarity is most commonly used in high dimensional 
spaces. For example, in information retrieval and text mining, cosine similarity gives a 
useful measure of how similar two documents are [9]. 

In this paper, we combine cosine similarity with neural networks. We use cosine 
similarity instead of dot product when computing the pre-activation. That can be seen 
as a normalization procedure, which we call cosine normalization. Equation 3 shows 
the cosine normalization. 


wex 


(3) 


Nnetnorm = COS 0 = 
|wl|x| 


To extend, we can use the centered cosine similarity, Pearson Correlation Coeff- 
cient (PCC), instead of dot product. By dividing the magnitude of w and x, the input to 
activation function is bounded between —1 and 1. Higher learning rate could be used 
for training without the risk of large variance. Moreover, network with cosine nor- 
malization can be trained by both batch gradient descent and stochastic gradient des- 
cent, since it does not depend on any statistics on batch or mini-batch examples. 

We compare our cosine normalization with batch, weight and layer normalization 
in fully-connected neural networks on the MNIST and 20NEWS GROUP data sets. 
Additionally, convolutional networks with different normalization techniques are 
evaluated on the CIFAR-10/100 and SVHN data sets. Experiments show that cosine 
normalization and centered cosine normalization (PCC) achieve better performance 
than other normalization techniques. 


2 Background and Motivation 


Large variance of neuron in neural network makes the model sensitive to the change of 
input distribution, thus results in poor generalization. Moreover, variance could be 
amplified as information moves forward along layers, especially in deep network. Large 
variance could also aggravate the internal covariate shift, which refers the change of 
distribution of each layer during training, as the parameters of previous layers change [2]. 
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Internal covariate shift slows down the training because the layers need to continuously 
adapt to the new distribution. Traditionally, neural networks use dot product to compute 
the pre-activation of neuron. The result of dot product is unbounded. That is to say, the 
result could be any value in the whole real space, thus increases the risk of large variance. 

Using small weights can alleviate this problem, since the pre-activation net in Eq. 2 
will be decreased when |w| is small. Weight decay [3] and max normalization [4, 5] are 
methods that try to make the weights to be small. Weight decay adds an extra term to the 
cost function that penalizes the squared value of each weight separately. Max normal- 
ization puts a constraint on the maximum squared length of the incoming weight vector of 
each neuron. If update violates this constraint, max normalization scales down the vector 
of incoming weights to the allowed length. The objective (or direction to objective) of 
original optimization problem is changed when using weight decay (or max normaliza- 
tion). Moreover, they bring additional hyper parameters that should be carefully preset. 

Batch normalization [2] uses statistics calculated from mini-batch training examples 
to normalize the pre-activation. The normalized value is re-scaled and re-shifted using 
additional parameters. Since batch normalization uses the statistics on mini-batch 
examples, its effect is dependent on the mini-batch size. To overcome this problem, 
normalization propagation [10] uses a data-independent parametric estimate of mean 
and standard deviation, while layer normalization [6] computes the mean and standard 
deviation from the same layer on a single training case. Weight normalization [7] re- 
parameterizes the incoming weight vector by dividing its norm. It decouples the length 
of weight vector from its direction, thus partially bounds the result of dot product. But 
it does not consider the length of input vector. These methods all bring additional 
parameters to be learned, thus make the model more complex. 

An important source of inspiration for our work is cosine similarity, which is 
widely used in data mining and machine learning [8, 9]. To thoroughly bound dot 
product, a straight-forward idea is to use cosine similarity. We combine cosine simi- 
larity with neural network, and the details will be described in the next section. 


3 Methods 


3.1 Cosine Normalization 


To decrease the variance of neuron, we propose a new method, called cosine nor- 
malization, which simply uses cosine similarity instead of dot product in neural net- 
work. Cosine normalization bounds the pre-activation between —1 and 1. The result 
could be even smaller when the dimension is high. As a result, the variance can be 
controlled within a very narrow range. A simple multi-layer neural network is shown in 
Fig. 1. Using cosine normalization, the output of hidden unit is computed by Eq. 4, 
where net„orm 18 the normalized pre-activation, w is the incoming weight vector and x is 
the input vector, f is nonlinear activation function. 


o = f{netyom) = f (cos 0) =1 (E - *) (4) 
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Fig. 1. A simple neural network with cosine normalization. The output of hidden unit is the 
nonlinear transform of cosine similarity between input vector and incoming weight vector. 


We use gradient descent (back propagation) to train the neural network with cosine 
normalization. Comparing to batch normalization, cosine normalization does not 
depend on any statistics on batch or mini-batch examples, so the model can be trained 
by both batch gradient descent and stochastic gradient descent. The procedure of back 
propagation in neural network with cosine normalization is the same as ordinary neural 
network except the derivative of netnorm With respect to w or x. 

To show the derivative conveniently, the cosine normalization can be rewritten as 
Eq. 5. Then, the derivative of net,o,m With respect to w; or x; can be calculated by Eq. 6 
or Eq. 7. 


DR 


Onetnorm = Wi > (wixi) 


Xi 
m MW (DD 


Onetnorm Xi 2 (w;x;) 


Wi 
Mm (Da) DW 


As pointed in [11], centering the inputs of units can help the training of neural 
networks. Batch or layer normalization centers the data by subtracting the mean of 
batch or layer, while mean-only batch normalization can enhance the performance of 
weight normalization [7]. We can use Pearson Correlation Coefficient (PCC), which is 
centered cosine similarity as shown Eq. 8, to extend cosine normalization, where 4, is 
the mean of w and u, is the mean of x. 


(5) 


Nelyorm = COS 0 = 


(6) 
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3.2 Implementation 


When implementing of cosine normalization in fully-connected nets, we just need 
divide the norm of incoming weight vector, as well as the norm of input vector. The 
input vector is the output vector of previous layer. That is to say, the hidden units in the 
same layer have the same norm of input vector. While in the convolutional nets, the 
input vector is constrained in a receptive field. Different receptive fields have different 
input norms, but the same incoming weight norm since different receptive fields share 
the same weight. 

Empirically, we find that using ReLU activation function, the result of normal- 
ization needs no re-scaling and re-shifting. Therefore, there is no additional parameter 
to be learned or hyper-parameter to be preset. However, when using other activation 
functions, like Sigmoid, Tanh, or Softmax, the result of normalization should be re- 
scaled and re-shifted to fully utilize the non-linear regime of the functions. 

One thing should be noticed is that cosine similarity can only measure the similarity 
between two non-zero vectors, since denominator can not be zero. Non-zero bias can 


be added to avoid the situation of vector of zero. Let w = [wi,w2,...,w;], and 
x= [x1,x2,...,xi]. After adding bias, then w becomes [wo,w1,w2,...,wi], and x 
becomes [xo, x1, X2, . . ., X;], where wo and x9 should be non-zero. 


As mentioned above, cosine normalization makes the pre-activation within a very 
narrow range. As a result, when using non ReLU activation functions, e.g. Sigmoid, 
Tanh, or Softmax, the result of normalization should use larger re-scaling coefficient to 
fully utilize the non-linear regime of the functions. Besides, as shown in Eqs. 6 and 7, 
the magnitudes of derivatives are much smaller since they are also divided by the 
length of w and x. Therefore, we need larger learning rate to train the network with 
ReLU activation when the result of normalization do not re-scale and re-shift. 


4 Experiments 


In this section, we compare our cosine normalization and centered cosine normalization 
(PCC) with batch, weight and layer normalization in fully-connected neural networks 
on the MNIST and 20NEWS GROUP data sets. Additionally, convolutional networks 
with different normalization are evaluated on the CIFAR-10, CIFAR-100 and SVHN 
data sets. 


4.1 Fully-Connected Networks 


There are two data sets used in this section. (1) MNIST. The MNIST [12] data set 
consists of 28 x 28 pixel handwritten digit black and white images. The task is to 
classify the images into 10 digit classes. There are 60, 000 training images and 10, 000 
test images in the MNIST data set. We scale the pixel values to the [0, 1] range before 
inputting to our models. (2) 20NEWS GROUP. The original training set contains 
11269 text documents, and the test set contains 7505 text documents. Each document is 
classified into one topic out of 20. For convenience of using mini-batch gradient 
descent, 69 examples in training set and 5 examples in test set are randomly dropped. 
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As a result, there are 11200 training examples and 7500 test examples in our experi- 
ments. The words, of which document frequency is larger than 5, are used as the input 
features. There are 21567 feature dimensions finally. Then, the model of Term 
Frequency-Inverse Document Frequency (TF-IDF) is used to transform the text doc- 
uments into vectors. After that, each feature is re-scaled to the range of [0, 1]. 

A fully-connected neural network which has two hidden layers is used in experi- 
ments of MNIST and 20NEWS GROUP. Each hidden layer has 1000 units. The last 
layer is the Softmax classification layer with 10-class for MNIST, and 20-class for 
20NEWS GROUP. ReLU activation function is used in the hidden layers. All weights 
are randomly initialized by truncated normal distribution with 0 mean and 0.1 variance. 
Mini-batch gradient descent is used to train the networks. The batch size is 100. In our 
experiments, we use no re-scaling and re-shifting after normalization for hidden layers 
which use ReLU activation. However, for the last layer, we re-scale the normalized 
values before inputting to Softmax. The learning rate of the cosine normalization, 
centered cosine normalization (PCC), batch normalization, weight normalization, layer 
normalization is 10, 10, 1, 1, 1, respectively in our experiments. No any regularization, 
dropout, or dynamic learning rate is used. We train the fully-connected nets with 200 
epochs since the performances are not improved anymore. 

The results of test error for MNIST are shown in Fig. 2. As we can see, the 
converging speeds for different normalization techniques are close. That observation is 
also true for other data sets we will present next. That is to say, cosine normalization 
can accelerate the training of networks as well as other normalization. We can also 
observe that centered cosine normalization (Pearson Correlation Coefficient) and cosine 
normalization achieve similar test errors, and which are slightly better than layer 
normalization. Centered cosine normalization achieves the lowest mean of test error 
1.39%, while cosine and layer normalization achieve 1.40%, 1.43% respectively. 
Weight normalization has the highest test error 1.65 comparing to other normalization. 
Although batch normalization gets lowest test error at some point, it causes large 
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Fig. 2. The MNIST test errors of different normalization techniques. 
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variance of test error as training continues. Large fluctuation of batch normalization is 
caused by the change of statistics on different mini-batch examples. 

The results for 20NEWS GROUP are shown in Fig. 3. Centered cosine normal- 
ization achieves the lowest test error 29.37%, and cosine normalization achieves the 
second lowest test error 31.73%. The batch normalization performs poorly in this task 
of high dimensional text classification. It only achieves 43.94% test error. Weight 
normalization (33.55%) and layer normalization (33.29%) achieve close performances. 


Both batch and weight normalization have larger variances of test error than other 
normalization. 
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Fig. 3. The 20NEWS test errors of different normalization techniques 
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Fig. 4. The CIFAR-10 test errors of different normalization techniques. 
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Fig. 5. The CIFAR-100 test errors of different normalization techniques. 
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Fig. 6. The SVHN test errors of different normalization techniques. 


4.2 Convolutional Networks 


In this section, convolutional networks with different normalization are evaluated on 
the CIFAR-10, CIFAR-100 and SVHN data sets. (1) CIFAR-10/100. CIFAR-10 [13] is 
a data set of natural 32 x 32 RGB images in 10-classes with 50, 000 images for 
training and 10, 000 for testing. CIFAR-100 is similar with CIFAR-10 but with 100 
classes. To augment data, the images are cropped to 24 x 24 pixels, centrally for 
evaluation or randomly for training. Then, a series of random distortions are applied: 
(a) randomly flip the image from left to right, (b) randomly distort the image brightness, 
(c) randomly distort the image contrast. The procedure of augmentation is the same as 
CIFAR-10 example in Tensorflow [14]. (2) SVHN. The Street View House Numbers 
(SVHN) [15] dataset includes 604, 388 images (both training set and extra set) and 
26, 032 testing images. Similar to MNIST, the goal is to classify the digit centered in 
each 32 x 32 RGB image. We augment the data using the same procedure as CIFAR- 
10/100 mentioned above. 


390 C. Luo et al. 


To evaluate the convolutional networks, a VGG-like architecture, with 3 * Con- 
v512 - Maxpooling - 3 * Conv512 - Maxpool - 3 * Conv512 - Maxpool - 2 * Ful- 
ly1000 - Softmax, is evaluated in experiments of CIFAR-10/100 and SVHN. Each 
convolutional layer has 3 x 3 receptive fields with a stride of 1, and each max pool 
layer has 2 x 2 regions with a stride of 1. We train the convolutional nets 105 step 
since the performances are not improved anymore. The batch size is 128. Other setups 
are the same as the experiments of fully-connected networks. 

The results for CIFAR-10 are shown in Fig. 4. Centered cosine normalization 
achieves the lowest test error 6.39%, and cosine normalization achieves the second 
lowest test error 7.33%. The layer normalization also achieves good performance, 
better than batch normalization, in this experiment. It achieves 7.42% test error. Batch 
normalization achieves test error 8.08%, and still has larger variance of test error than 
other normalization. Weight normalization achieves the highest test error 8.55%. 

The results for CIFAR-100 are shown in Fig. 5. Centered cosine normalization 
achieves the lowest test error 27.49%. Cosine normalization and batch normalization 
achieve very close performance, 31.02% and 31.01% respectively. But batch nor- 
malization has larger variance of test error. Weight normalization achieves the highest 
test error 37.87%. 

The results for SVHN are shown in Fig. 6. Centered cosine normalization achieves 
the lowest test error 2.22%, and cosine normalization achieves the second lowest test 
error 2.34%. Batch and layer normalization achieve test error 2.49%, 2.58% respec- 
tively. Weight normalization has the highest test error 2.63%. 


5 Conclusions 


In this paper, we propose a new normalization technique, called cosine normalization, 
which uses cosine similarity or centered cosine similarity, Pearson correlation coeffi- 
cient, instead of dot product in neural networks. Cosine normalization bounds the pre- 
activation of neuron within a narrower range, thus makes lower variance of neurons. 
Moreover, cosine normalization makes the model more robust for different input 
magnitude. Networks with cosine normalization can be trained using back propagation. 
It does not depend on any statistics on batch or mini-batch examples, and performs the 
same computation in forward propagation at training and inference times. We evaluate 
cosine normalization on the fully-connected networks, convolutional networks and 
recurrent networks on various data sets. Experiments show that cosine normalization 
and centered cosine normalization (PCC) achieve better performance than other nor- 
malization techniques. 
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Abstract. This work involves the use of combined forces of data-driven 
machine learning models and high fidelity density functional theory for 
the identification of new potential thermoelectric materials. The tradi- 
tional method of thermoelectric material discovery from an almost lim- 
itless search space of chemical compounds involves expensive and time 
consuming experiments. In the current work, the density functional the- 
ory (DFT) simulations are used to compute the descriptors (features) and 
thermoelectric characteristics (labels) of a set of compounds. The DFT 
simulations are computationally very expensive and hence the database 
is not very exhaustive. With an anticipation that the important features 
can be learned by machine learning (ML) from the limited database and 
the knowledge could be used to predict the behavior of any new com- 
pound, the current work adds knowledge related to (a) understanding 
the impact of selection of influence of training/test data, (b) influence of 
complexity of ML algorithms, and (c) computational efficiency of com- 
bined DFT-ML methodology. 


Keywords: Machine learning - Density functional theory 
Thermoelectric - Material screening - Discovery 


1 Introduction 


Thermoelectric (TE) materials are receiving wide attention due to their poten- 
tial role in mitigating global greenhouse effects as they enable conversion of 
waste heat energy directly to electrical energy. Currently, the three approaches 
to find better thermoelectric material involve: (a) traditional experimental app- 
roach, (b) physics based computational approach like Density Functional The- 
ory (DFT), and (c) recent machine learning (ML) based data-driven approach. 
Amongst these, the machine learning approach has shown some success in find- 
ing new chemistries (that are capable of being thermoelectric) but it is a nascent 
application area with limited published work. There are certain limitations with 
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all the approaches, like: (a) The traditional experimental approaches are not 
efficient way of exploring new unknown chemistries and they focus mostly on 
modifying known material compounds by doping and nano-structuring to make 
these known thermoelectric materials better, while, (b) high fidelity physics 
based models like DFT are computationally prohibitive to use, and (c) for ML, 
obtaining bountiful data is an expensive process. ML models need to be able to 
generalize well, and learn patterns well enough from a small pool of available 
training data to be able to search for new potential materials in the vast expense 
of search-space of unknown materials. The current work aims to contribute to 
the field of machine learning and material screening by understanding influence 
of limited dataset, and whether it can be mitigated by studying: (a) influence of 
training-test split in model development, (b) influence of model selection and (c) 
by applying a framework combining data-driven machine learning models with 
physics-based density functional theory (DFT) to identify potential thermoelec- 
tric materials using a metric called ‘figure of merit’. DFT enables generation 
of training data for ML, and a trained ML is expected to save time in finding 
potential material in the vast material search-space. The main objectives of this 
work can be enumerated as: 


1. In the limited dataset scenario: understand the influence of training/test com- 
pound selection on ML predictions. 

2. Combine data-driven models with physics-driven models to mitigate limited 
dataset scenarios, and understanding efficiency of this approach in identifying 
potential thermoelectric materials. 

3. Compare the performance of the two ML algorithms: Random Forest (RF) 
and Deep Neural Network (DNN) for the limited dataset scenario. 


2 Methodology and Data 


This is treated as a regression problem, where the ML model learns to predict the 
figure of merit (ZT) values of a given compound at a given temperature and at 
a given chemical potential state. The performance of a material as a thermoelec- 
tric material is evaluated using this ZT. A material with a high ZT is supposed 
to be a good thermoelectric material. The ZT is a function of Seebeck coeffi- 
cient, temperature, electrical conductivity, the electronic thermal conductivity, 
and lattice thermal conductivity. Previous research on thermoelectric materials 
involving machine learning did not use ZT as a characteristics, instead, it used 
the key properties in a stand-alone way (i.e. band gap, Seebeck coefficient, etc.). 
The three key components needed for developing the methodology are described 
next: (a) Data: data for model development (cross-validation/training data), for 
model testing (hidden test data) and for model application (search-space data 
to look for potential materials), (b) Descriptors (features), and (c) Choice of ML 
algorithms. These three components are discussed next: 
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2.1 Descriptors 


Descriptors (known as features in ML community) are the characteristics of 
materials (e.g., crystal structure, chemical formula, etc.) that might correlate 
with material’s properties of interest (ZT). Here, we use 50 features (descrip- 
tors or independent variables) for a given data-point. The features involve both 
numerical variables and categorical variables (crystal shape). The list of 50 fea- 
tures used are: temperature, chemical potential - eV, elements in cell, mean and 
variance of atomic mass, atomic radius, electronegativity, valence electrons, a 
set of features related to periodic table (group numbers, row numbers,electronic 
configurations), 6 one-hot encoded features for crystal shape (‘tetragonal’, ‘trig- 
onal’, ‘orthorhombic’, ‘cubic’, ‘monoclinic’, ‘triclinic’, hexagonal”. 


2.2 Data 


Limited Data Scenario: The dataset is deemed limited in this work because 
based on the available training dataset of just 115 compounds (having about 
87,975 instances/data points with known ZT values), the trained ML model 
has to learn to predict potential compounds (i.e. ZT values) in a vast chemical 
search-space of 4800 compound (having 2,40,312 data-points). The compounds 
in training dataset will be different than the compounds in the chemical search- 
space. 


Data Generation and DFT: It is time-consuming to generate dataset using 
experiments. Here, the database is generated using high-fidelity physics-driven 
DFT followed by semi-classical Boltzmann theory. The DFT is a computational 
quantum mechanical modeling method used to investigate the electronic struc- 
ture (principally the ground state) of many-body systems, in particular atoms, 
molecules, and the condensed phases. Using this theory, the properties of a sys- 
tem can be determined by using functionals, i.e. functions of the spatially depen- 
dent electron density. Boltzmann theory helps to estimate the Boltzmann trans- 
port properties of candidate materials (like, Seebeck Coefficient, thermal con- 
ductivity, electrical conductivity) based on DFT-predicted band structures. The 
ZT for each compound is then computed using these transport properties. The 
ZT values of about 115 materials (compounds) have been generated. A database 
of about 87,975 instances (datapoints) comprising of 115 compounds materials 
has been created, as each compound material is studied over 15 temperature 
levels and over 51 chemical potential states. Thus, the number of datapoints are 
115 x 51 x 15 = 87,975. Each instance (or data-point) has 50 features associated 
with it. Thus, the input data matrix for building ML model is 87,975 x 50 - 
which is to be divided into training data (training and validation sets) and test 
data set. 


Uniqueness in Splitting the Training and Test Dataset: We do not ran- 
domly split the 87,975 datapoints into training and test dataset. The dataset is 
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split so that ML model is trained on certain compounds and the model is tested 
on unseen compounds. About 85% of data-set (about 98 compounds - a dataset 
of 74,970 x 50) is used for model building through both training and validation 
sets, and 15% of dataset (about 17 compounds - a dataset of 13,005 x 50) is to 
test the model. Since, the purpose is to test the generalization ability of the ML 
model to discover new chemical species - so, we looked at whether the ML model 
trained on 98 compounds can help to predict the ZT values of the unseen 17 
compounds. Hence, sensitivity of selection of compounds into training and test 
data needs to be checked. This is checked by creating 3 cases of train/test split 
data: 


1. Case 1. Test/train split. Randomly selecting 17 compounds in test (corre- 
sponding to 13,005 datapoints) and 98 compounds in train (corresponding to 
74,970 datapoints) (with random seed 0.2). 

2. Case 2. Test/train split. Randomly selecting 17 compounds in test and 98 
compounds in train (with random seed 0.4). A different random selection 
gives different sets of compounds in train/test than case 1. 

3. Case 3. Deterministically selecting Test and train compound. Out of the 115 
compound database, a chunk of 17 compounds lying in the middle have been 
selected as test data. These 17 compounds in the middle do not possess 
extreme characteristics (like either being too simple compound or too com- 
plex compound, which are represented in the values of features associated 
with the compound), while the training data encompasses all types of com- 
pound. Here, by complex compounds, we refer to compounds with more than 
3 elements. 


Search-Space Data: For screening and discovering potential thermoelectric 
materials, the trained machine learning model has been applied on database of 
silicides (silica based compounds). This database is extracted from the material 
science project, and is called chemical search-space data set in this work. The 
search-space data-matrix size is: 2,40,312 data instance x 50 features. 


2.3 Choice of Algorithms 


Here, two different algorithms have been tested: Random Forest [1] and a more 
complex Deep Neural Network [2]. This work is intended to understand whether 
with the limited dataset, a complex model can perform well or not. 


2.4 Model Selection - Cross Validation and Learning Curve 


The two machine learning models have been compared using the cross-validation 
(CV) method. CV is a model validation technique for assessing the generalization 
ability of a machine learning algorithm to an independent data set. In our work, 
we split the original dataset into the ‘training’ and the ‘test’ dataset. Here, we 
have selected a 3-fold CV procedure, where the ‘training set’ is split further into 
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3 different smaller sets. The model prediction is learned using 2 of these 3 folds 
at a time, and the 3rd fold that is left out is used for validation (called validation 
set). The average R2 (coefficient of determination) score from 3-fold CV is used as 
performance measure accuracy. Best possible R2 score is 1.0 suggesting a model 
with high accuracy and the score can be negative if the model performs badly. 
The learning curve helps to obtain the best parameter sets for the two models 
using the above CV process. In Fig. 1, we use CV procedure to obtain a learning 
curve. The curve shows the variation of average R2 score with training data 
and validation data (for RF) and variation of average R2 score with increasing 
epochs (iteration) for DNN. These curves help in understanding the bias-variance 
tradeoff. The learning curve (in Fig.1) is shown for only case (case 3), and for 
only the best parameter sets of case 3 (for brevity). For case 3, the best parameter 
sets are: RF: Maximum number of trees - 30. The maximum depth of the tree 
is 20. DNN: The network used in this work comprises of an input layer (with 50 
neurons representing the 50 input feature), an output layer and six hidden layers 
(comprising of following number of units in each successive layer: 43; 20; 20; 15; 
10; 5 respectively). A combination of ReLU and Tanh activation functions are 
used in this work. 

The learning curve (in Fig. 1) suggests some over-fitting for both the models; 
which is more dominant in the case of DNN compared to the RF model. This 
could be attributed to the need for larger data needed by DNN models. The R2 
score on training data for both RF and DNN are in the range of 0.95-1, while, for 
the validation data (called test in DNN figure here), the R2 scores fall drastically 
in case of DNN to R2 = 0.45, while, the R2 scores falls slightly to 0.985 for RF. 
The overfitting (variance errors) is seen in other cases too (case 1 and case 2, but 
these learning curves are not shown here for brevity). The influence of 3 different 
train-test split on the performance of two ML models is considered next. It needs 
to be seen whether proper selection of training compound-test compound split 
can mitigate the overfitting and improve generalization ability of ML models. 
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Fig. 1. Judging bias (underfitting) vs variance (overfitting) errors for RF and complex 
DNN models for the two cases 
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3 Results and Discussion 


Material screening is challenging in the sense that using the available limited 
database of known chemistry, the trained ML model should have learned the 
ability to find new potential material characteristics in new unseen chemistry 
in the vast material search-space. It is important to understand whether the 
way to split the limited material database into training dataset (training and 
validation dataset) and testing dataset (of unseen compounds) will influence the 
performance of the two machine learning models (simple RF or complex DNN). 


3.1 Sensitivity Study: Influence of Training and Testing Dataset 
Selection 


Figure2 shows the influence of splitting the training/test data on the perfor- 
mance of models for the three cases. For each case, the Fig. 2 shows the predicted 
ZT values vis-a-vis the actual ZT values for the compounds in training and test 
data by the two models (RF and DNN). Results for the 3 cases show: 


Case 1 and Case 2 (Comparing R2 Scores on Train and Test Data by the Two 
Models): Both cases have randomly generated but different sets of 98 compounds 
for training and 18 compounds in test. 


DNN Performance: R2 score for case 1 drops to 0.2; while, the corresponding 
case 1 train R2 score is 0.97. Similarly, case 2 test R2 score drops to — 0.14; while, 
the corresponding case 2 train R2 score is 0.97. The large drop in R2 scores for 
test indicates poorer generalization ability for DNN. 


RF Performance: In case of RF too, R2 scores drop for the two test dataset, but 
its performance is much better than the DNN. For RF, the Case 1 test R2 score is 
0.82; while the corresponding case 1 train R2 score is 0.99. Similarly, Case 2 test 
R2 score drops to 0.23; while the corresponding case 2 train R2 score of 0.99. 

Thus, for both RF and DNN, as the split of train/test varies, the gener- 
alization ability is influenced (despite selecting the best parameter set of the 
respective model for that database during CV). The reason for lower R2 scores 
in case 2 test dataset (for both the models) as compared to their case 1 test scores 
is that the 98 randomly selected compounds in case 2 training dataset with their 
features (a dataset of 74,970 x 50) do not provide similar pattern characteristics 
(i.e. variation of ZT with features) as in the 17 compound case2-test dataset (a 
dataset of 13,004 x 50). 


Case 3 (Comparing R2 Scores on Train and Test Data): Case 3 involves 98 
training compounds that encompasses both simple and extreme compounds, and 
hence the models trained on it are able to capture the pattern to enable determi- 
nation of ZT values of data-points pertaining to the 17 unseen test compounds. 
That is why we see improved predictions by the DNN and RF model on the case 
3-test dataset: DNN shows a case 3-test R2 score of 0.45; while corresponding 
case 3 train R2 score is 0.96. 

RF shows a case 3 test R2 score of 0.76; while corresponding case 3 train R2 
score of 0.99. 
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Fig. 2. Predicted vs actual ZT (with R2-score) for DNN and RF on training and 
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Next, we check whether the improvements in generalization ability (better 
test R2 scores) brought about by balanced training-test split leads to better 
predictions of material in both models?. 


3.2 Comparison of RF vs. DNN Models: Material Screening and 
Efficiency 


Searching for Potential Thermoelectric in New Search-Space: Figure 3 
shows the best two thermoelectric materials identified in a new chemical search- 
space of silicide materials of 4800 compounds for the 3 cases. For brevity, only 
top two are shown in Fig. 3 but the results explained are beyond the best two 
predicted. This chemical search-space has not been exposed to the ML models 
during their training/validation/testing phase. In all the figures, the predicted 
figure of merit (ZT) is plotted against one of the most influential features (chem- 
ical potential - eV). These six compounds below have the highest predicted ZT 
values as obtained by DNN and RF. 

The RF is mostly predicting comparatively simpler compounds than the 
DNN with maximum value of ZT in the range of 3-3.6. RF has predicted only 
simple compounds (such as Li2MgSi, SrMgSi, BeSilr2, SiP207, VSiPt) as poten- 
tial thermoelectric silicides in its top two predictions. While, DNN is predicting 
complex compounds (with more than 3 elements) in about 66% of the top two 
predictions (with compounds such as Sr2A135i3HO13 in case 1, LiCoSiO4 in 
case 2, and Na3CaAl3Si3SO16 and Na3VSiBO7 in case 3) with higher maximum 
value of ZT in range 4-5. Both DNN and RF have identified a common thermo- 
electric silicide (BeSilr2) as potential candidate but predict a different maximum 
ZT value (RF predicts ZT of 3.5, while DNN predicts around ZT = 4.5). 

DNN is learning complex patterns than RF and predicting higher ZT values 
due to overfitting (higher variance error) as observed in previous fits in Fig. 2. 
Further, DNN is predicting erroneous profile of Zt as a function of chemical 
potential (Fig. 3(c) left, and (e) both) as they are not physically realistic. Thus, 
the split in training data is not benefitting DNN. The solution for overfitting 
in DNN is to either build artificial neural network (ANN) models with simpler 
architecture or to generate a larger training dataset. 

Since the intention of this paper was to gain knowledge about possible behav- 
ior of DNN in current material screening applications (where most have limited 
dataset), so simpler ANN models were not shown in this work. DNN despite 
being the most popular model today does not work when dataset is limited. 


Validation of Selecting Training/Test Dataset and Model Selection: 
In the literature, currently the materials of the form Mg2LiSi are under inves- 
tigation [3]. Li2MgSi is the closest form that has been predicted by RF in the 
balanced Case 3 training/test dataset. This work shows the importance of bal- 
ancing training/test dataset when the dataset is limited and when, the trained 
model has to have good generalization ability so as to find materials in new 
chemical space. Most of the complex compounds predicted by DNN are not pos- 
sible to test experimentally in lab, but the overfitting seen in DNN performance 
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Fig. 3. DNN vs RF (shaded) predicted best two thermoelectric materials for the three 
cases. eV refers to chemical potential on the horizontal axis. DNN suggests more com- 
plex compounds as compared to the Random Forest. 
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suggests that it is better not to pursue those complex models (as the results may 
not be reliable). 


Computational Efficiency: For DFT alone, the CPU consumption is between 
25 and 1500 h to evaluate Zr value of a composition (compound), and the average 
CPU time per compound is 85h for finding Zt of material. It would take around 
4,08,000 CPU hrs for discovering the material with best ZT amongst the 4800 
compound chemical search-space. For ML step alone, the computation cost for 
obtaining Zt values of about 4800 compounds, after getting trained on dataset of 
115 compounds is: 132 s for DNN and 80 s for RF. The cost of preparing training 
base for these 115 compounds from DFT could be around = 85h per compound 
x 115 compounds =9775h. Thus, we can neglect the 132 s from DNN and 80 
s of RF with respect to the 9775h required to generate the training database. 
Thus, the total cost for evaluating Zt using ML approach for 4800 compounds 
is just 2% of time needed by the DFT-alone method. 


4 Conclusions 


1. In limited dataset scenario: RF has lesser variance error than DNN and is 
seen to predict potentially simpler compounds from the search-space data 
than the DNN model. DNN predicts complex compounds from search-space 
data (that are difficult to make in lab and verify). Further, DNN sometimes 
shows physically unrealistic Zt profile prediction due to overfitting and the 
solution to this is that only more data can make the DNN better. 

2. Significant influence of training-test split on the model is seen despite using 
CV procedure to select the best model parameters for generalization. Hence, 
when dataset is limited - this aspect should be checked. Amongst the three 
cases (two random and one deterministic train-test split), the variances error 
lowered for the case where training data could encompass compounds with 
extreme features. The RF model also provided the ‘verifiable’ predicted poten- 
tial thermochemical in search-space (Li2MgSi) from this balanced determin- 
istic train-test dataset, but this strategy did not benefit DNN. 

3. Combined DFT and machine learning approach with RF is computationally 
efficient than an approach involving DFT alone. 
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Abstract. Recently, deep neural networks (DNNs) have been widely 
applied in mobile intelligent applications. The inference for the DNNs is 
usually performed in the cloud. However, it leads to a large overhead of 
transmitting data via wireless network. In this paper, we demonstrate 
the advantages of the cloud-edge collaborative inference with quantiza- 
tion. By analyzing the characteristics of layers in DNNs, an auto-tuning 
neural network quantization framework for collaborative inference is pro- 
posed. We study the effectiveness of mixed-precision collaborative infer- 
ence of state-of-the-art DNNs by using ImageNet dataset. The experi- 
mental results show that our framework can generate reasonable network 
partitions and reduce the storage on mobile devices with trivial loss of 
accuracy. 


Keywords: Neural network quantization - Auto-tuning framework 
Edge computing - Collaborative inference 


1 Introduction 


In recent years, deep neural networks (DNNs) [14] are widely used and show 
impressive performance in various fields including computer vision [12], speech 
recongnition [9], natural language processing [15], etc. As the neural network 
architectures become more complex and deeper—from LeNet [13] (5 layers) to 
ResNet [8] (152 layers), the storage and computation of the model is increasing. 
In other words, it leads to more resource requirements for network training and 
inference. The large size of DNN models limits the applicability of the network 
inference on mobile edge devices. Therefore, most of artificial intelligence (AI) 
applications on mobile devices send input data of DNN to cloud servers, and 
the procedure of network inference is executed in the cloud only. However, the 
cloud-only inference has some assignable weaknesses: (1) transmission overhead: 
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it leads to a large overhead of uploading data especially when the mobile edge 
devices are in the low-bandwidth wireless environments. (2) privacy disclosure: 
sometimes, personal data, e.g. one’s photos and videos, are not allowed to send 
to the cloud servers directly. 

Today’s mobile devices, such as Apple’s iPhone and NVIDIA’s Jetson TX2, 
have more powerful computability and larger memory. In addition, many neural 
network quantization methods [3,4,7,18,19] have been proposed for reducing the 
resource consumption of DNNs. By using quantization, the data of a network 
can be represented by low-precision values, e.g. INT8 (8-bit integer). On the one 
hand, low-precision data reduces storage of DNNs and enables network models to 
be stored on the mobile edge device with limited resources. On the other hand, 
with the use of high-performance libraries for low-precision computing [1,2], 
the speed of the network inference will be improved. This makes it possible to 
perform some or all parts of neural network inference on mobile devices and 
leads to a new inference mode: cloud-edge collaborative inference. 


Deployment Partition 


Inference ao 


) Inference of quantized 5) Inference in the cloud 
an on the edge device 


Fig. 1. Overview of auto-tuning framework 


In this paper, we propose an auto-tuning neural network quantization frame- 
work as shown in Fig. 1. During deployment, the framework profiles the operators 
of DNNs on edge devices and generates the candidate layers as partition points. 
When the neural network is ready to be used, the framework starts auto-tuning 
for network partition. In the time of inference, the first part of the network is 
quantized and executed on the edge devices, and the second part of the network 
is executed in the cloud servers. On the edge, we use quantized neural network 
to reduce storage and computation. In the cloud, we use original full-precision 
network to achieve high accuracy. 

In the collaborative inference, quantized neural networks can reduce the stor- 
age of models. Intermediate results of quantized networks are also low-precision 
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data, which can reduce data communication between cloud and edge. So user’s 
mobile device could transmit less data when using AI applications. Additionally, 
transmitting intermediate result data, rather than the original input data, can 
protect personal information. In realistic scenarios, the process of analysis and 
testing is tedious and time-consuming. It’s unfriendly for a program developer 
to test and decide how to partition the network. Our automatic tuning frame- 
work will help developers find the most reasonable partition of a DNN. The 
contributions of this paper are summarized as follows: 


e Analysis of DNN partition points — We analyze the structures of deep neural 
networks and show which layers are reasonable partition points. Based on the 
analysis, we could generate candidate layers as partition points of a specific 
neural network (Sect. 2.2) 

e Auto-tuning quantization framework for collaborative inference — We develop 
an auto-tuning neural network quantization framework for collaborative infer- 
ence between cloud and edge. The framework quantizes neural networks 
according to the candidate partition points and provides an optimal mixed- 
precision partition for cloud-edge inference by auto-tuning (Sect. 2.3). 

e Experimental study — We show the performance of collaborative inference of 
state-of-the-art DNNs by using ImageNet dataset. The framework generates 
reasonable network partitions and reduces the storage of inference on mobile 
devices with trivial loss of accuracy (Sect. 3). 


2 Auto-tuning Quantization Framework 


In this section, we present our auto-tuning neural network quantization frame- 
work. Firstly, we briefly introduce neural network quantization. Secondly, we 
analyze the structures of the state-of-the-art DNNs. Finally, we describe the 
auto-tuning partition algorithm. 


2.1 Neural Network Quantization 


In order to accelerate inference and compress the size of DNN models, many 
network quantization methods are proposed. Some studies focus on scalar and 
vector quantization [4,7], while others center on fixed-point quantization [18,19]. 
In this paper, we are mainly interested in scalar quantization of INT8, which is 
supported by many advanced computing libraries such as Google’s gemmlowp 
[1] and NVIDIA’s cuDNN [2]. In general, an operator computation of scalar 
quantized neural networks can be summarized as follows: 


e Offline Quantization 
Step 1. Find quantization thresholds (Tmin and Tax) for calculating scale 
factors of Input, Weights and Output; 
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Step 2. Quantize Input and Weights according to the following formula: 


Daralt) Train x RangeLp LE (Trini Tmax) 


Dai ( ) Tras = Tinin! (1) 
atao\Tt) = 
Q ||Viow-precision |loo x = Trag 
Vasen | x < rin 


where: Rangep is the range of low-precision values (e.g. 255 for INT8), 
Viow-precision is the set of low-precision values, Data(x) is the original value, 
Datag(z) is the quantized value. 

e On-device Computation 
Step 1. Outputg = Operator(Inputg, Weightsg); 
Step 2. Dequantize Outputg according to the following formula: 


(Enaz = Imin | 


Output = 
is RangeLp 


x Outputolx) + Tmin (2) 
Step 3. Output = ActivationFunction(Output); 
Step 4. Quantize Output as Input y ex; according to Formula 1. 


2.2 Candidate Network Partition Points 


In general, a deep neural network contains many kinds of layers such as con- 
volution layers, fully-connected layers and activation layers. We analyze the 
characteristics of different network layers and decide how to select candidate 
layers as reasonable partition points. The set of candidate layers, Rule = 
{I1, L2,..., Ln}, is based on the results of the following analysis. 
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Fig. 2. Partition points of DNNs 
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Table 1. Analysis of inception 


Partition Brother branch Inference mode of | Data transmission 
points exists? the brother branch 

1,13 No / INT8 x 1 

2,3,4,5 Yes Mobile edge INT8 x 4 

7,8,9 

6, 10, 11, 12 

2,3,4,5 Yes Cloud INT8 x 1 +FP32 x1 
7,8,9 

6, 10, 11, 12 


Layers in Inception Networks. Inception is a structure that contains 
branches, and these branches are executed in parallel and their results are merged 
into a network layer (e.g. concat layer). Figure2(a) is an example of inception 
from GoogLeNet [17]. As shown, the inception contains 13 possible partition 
points. If we try all the partition points, it will take a lot of time. We divide 
these partition points into two groups according to whether they have at least 
a brother branch (separate from the same layer and merge in the same layer). 
The results of the analysis are shown in Table 1. When a partition point has no 
brother branch (e.g. 1 and 13), the output of the sub-network on edge devices 
contains only 1 x INT8 Blob (4D array for storing data). When a partition point 
has a brother branch, there are two cases: (1) its brother branch runs on the edge 
devices, and the sub-network output contains 4 x INT8 Blobs; (2) its brother 
branch runs in the cloud, and the sub-network output contains 1 x INT8 Blob 
and 1 x FP32 Blob. The transmission data in first group is smaller than it is 
in the second group. Therefore, if a network layer in inception has a brother 
branch, the framework will not choose it as a candidate layer. 


Table 2. Analysis of residual network 


Data transmission 
INT8 x 1 
INT8 x 1+ FP32 x1 


Shortcut connection exists? 
No 
Yes 


Partition points 
1,5 
2,3,4,6 


Layers in Residual Networks. There are many shortcut connections in the 
residual network [8]. Figure 2(b) shows an example of a residual block which 
contains a shortcut connection. There are 6 possible partition points in this 
example. According to whether the shortcut connection of a partition points 
exists, we divide these partition points into two groups. When a partition point 
has no shortcut connection (e.g. 1 and 5), the output of the sub-network on edge 
devices contains only 1 x INT8 Blob. Otherwise, the output of the sub-network 
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contains 1 x INT8 Blob and 1 x FP32 Blob. Table 2 shows the analysis result. 
Therefore, the network layers with shortcut connections are not reasonable can- 
didate layers. 


Non-parametric Layers. Non-parametric layers, such as ReLU and pooling, 
have no parameters, so they require almost no memory storage. In addition, the 
computation of the non-parametric layers accounts for a very small proportion 
of the total network computation. Therefore, our framework merges the non- 
parametric layers into the nearest previous parametric layers, i.e. these non- 
parametric layers will not be used as candidate layers. 


2.3 Auto-Tuning Partition 


According to the candidate rule Rule, the framework performs auto-tuning par- 
tition for cloud-edge collaborative inference, as described in Algorithm1. The 
input of the algorithm contains candidate layer rules and a neural network. 
Firstly, candidate rules are used to select candidate partition points in the neu- 
ral network (lines 1-2). Secondly, all candidate partition networks are tested, 
and the information of performance is recorded in P (lines 3-9). The function 
of PredictPer formance can predict the performance of collaborative inference 
based on the results of off-line profiling. Finally, we find the best partition point 
in P for collaborative inference of mixed-precision neural network (lines 10-14). 


Algorithm 1. Auto-Tuning Partition 
Input: candidate rules Rule, neural network Net = { L1, L2,..., In} 
Output: optimize partition Prest 
P = 9; Poest — null; 
Candidate — {L;|L; € Rule}; 
for L; in Candidate do 
Neteage — Net.Split(First, Li); 
Netctoua — Net.Split(L; + 1, Last); 
Engine rage — Net rage(DataT ype<crnrs>); 
Enginecioua — Netctoua(DataT ype <r p32>); 
(Li, info) — Predict Per formance(Engine gage, Enginecioud); 
P — PU (Li, info); 
Env = GetEnvironment( Device gage); 
for p; in P do 
if Env(p;) is better than Env(poest) then 
| Pbest — Pi; 
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return Pbest; 
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3 Experiments 


In this section, we use ImageNet [6] dataset to test the collaborative inference of 
DNNs [8,12,16,17] and show results of our auto-tuning framework. We illustrate 
the most reasonable partition for each neural network. The inference of the edge 
performs on a mobile platform — NVIDIA Jetson TX2 (NVIDIA’s latest mobile 
SoC) — with 4 x ARM Cortex-A57 CPUs and 2 x Denver CPUs, 8G of RAM. 
The inference of the cloud performs on a server with Intel Core-i7 CPU, NVIDIA 
TITAN Xp GPU, 16G of RAM. We use Caffe [10] with cuDNN (version 7.0.5) on 
the GPU of cloud servers. We use gemmlowp’s [1] implementation on the CPU 
of the edge devices. 


3.1 Experimental Results 


Table3 summarizes the results of our framework. We tested AlexNet, VGG16, 
ResNet-18 and GoogLeNet in different wireless network environments. For each 
neural network, the framework gives the best partition point and the fastest 
partition point. According to the inference time and the speed-up in the table, 
we can see that sometimes the speed of collaborative inference is faster than 
that of the cloud inference only. This is due to the large transmission overhead 
in the low-bandwidth wireless environments. In collaborative inference, we only 
need to download the parameters required by the edge inference, which can 
significantly reduce the size of download data. If users need to achieve the fastest 
inference speed, the fastest partition point should be selected. If users need to 
avoid privacy disclosure, the best partition point should be selected. In addition, 
quantized neural networks do not lead to a significant drop in accuracy (usually 
less than 1%). 


Table 3. Experimental results of our framework 


Neural network AlexNet | VGG16 | ResNet-18 | GoogLeNet 
Wireless upload (KB/s) | 250 240 70 180 

Best partition point convd convl_2 res4a conv2 
Inference time (s) 0.36 5.65 1.86 1.16 
Speed-up 1.7x <1x 1.13x <1x 
Model download (KB) |2278 38 1569 121 

Model storage reduction | 96.17% |99.97% | 85.63% 98.22% 
TOP-1 accuracy | —0.09% 10.00% |-0.19%  |—0.10% 


Figure3 shows the collaborative inference time of each candidate layer in 
the wireless network environments. We take AlexNet as an example. Each bar 
represents a network partition, which consists of three parts: edge inference, 
data upload and cloud inference. After auto-tuning of framework, conv5 layer is 
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selected as the best partition point (marked with a hollow pentagram) and the 
fastest partition point (marked with a filled pentagram). On edge devices, we 
feed input data to the neural network and perform inference of layers from convl 
to conv5. The output data of conv5 (pool and relu are merged) is uploaded to 
the cloud, and then the inference of layers from fc6 to fc8 is executed in the 
cloud. The approach of collaborative inference achieves 1.7 speed-up. It can be 
seen that the accuracy drop of the network is trivial, and the largest accuracy 
loss in all partitions is only —0.11%. 
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4 Related Work 


Recently, many neural network quantization methods have been proposed. Gong 
et al. [7] and Cheng et al. [4] explored scalar and vector quantization methods 
for compressing DNNs. Zhou et al. [18], Zhou et al. [19] proposed fixed-point 
quantization methods. Cuervo et al. [5] and Kang et al. [11] designed frameworks 
that support collaborative computing of mobile applications. Their frameworks 
perform off-line partition for full-precision neural networks, and ours performs 
on-line partition for mixed-precision neural networks. Overall, the application of 
quantization methods in cloud-edge collaborative inference has not been studied 
yet. To the best of our knowledge, it is the first attempt to build framework for 
cloud-edge collaborative inference of mixed-precision neural networks. 


5 Conclusion 


In this paper, we propose an auto-tuning neural network quantization framework 
for collaborative inference. We analyze the characteristics of network layers and 
provide candidate rules to choose reasonable partition points. The auto-tuning 
framework helps developers get the most suitable partition of a neural network. 
The cloud-edge mode (i.e. collaborative inference) reduces the storage of infer- 
ence on mobile devices with trivial loss of accuracy and could protect personal 
information. 
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Abstract. Deep learning on graphs has become a popular research topic 
with many applications. However, past work has concentrated on learning 
graph embedding tasks, which is in contrast with advances in generative 
models for images and text. Is it possible to transfer this progress to 
the domain of graphs? We propose to sidestep hurdles associated with 
linearization of such discrete structures by having a decoder output a 
probabilistic fully-connected graph of a predefined maximum size directly 
at once. Our method is formulated as a variational autoencoder. We 
evaluate on the challenging task of molecule generation. 


1 Introduction 


Deep learning on graphs has very recently become a popular research topic [3]. 
Past work has concentrated on learning graph embedding tasks so far, i.e. encod- 
ing an input graph into a vector representation. This is in stark contrast with 
fast-paced advances in generative models for images and text, which have seen 
massive rise in quality of generated samples. Hence, it is an intriguing question 
how one can transfer this progress to the domain of graphs, 2.e. their decoding 
from a vector representation. Moreover, the desire for such a method has been 
mentioned in the past [5]. 

However, learning to generate graphs is a difficult problem, as graphs are 
discrete non-linear structures. In this work, we propose a variational autoencoder 
[9] for probabilistic graphs of a predefined maximum size. In a probabilistic 
graph, the existence of nodes and edges, as well as their attributes, are modeled 
as independent random variables. 

We demonstrate our method, coined GraphVAE, in cheminformatics on the 
task of molecule generation. Molecular datasets are a challenging but conve- 
nient testbed for generative models, as they easily allow for both qualitative and 
quantitative tests of decoded samples. While our method is applicable for gen- 
erating smaller graphs only and its performance leaves space for improvement, 
we believe our work is an important initial step towards powerful and efficient 
graph decoders. 


© Springer Nature Switzerland AG 2018 
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Fig. 1. Illustration of the proposed variational graph autoencoder. Starting from a 
discrete attributed graph G = (4, E, F) on n nodes (e.g. a representation of propylene 
oxide with 3 carbons and 1 oxygen), stochastic graph encoder qy(z|G) embeds the 
graph into continuous representation z. Given a point in the latent space, our novel 
graph decoder pe(G|z) outputs a probabilistic fully-connected graph G= (A, E,F ) on 
predefined k > n nodes, from which discrete samples may be drawn. The process can 
be conditioned on label y for controlled sampling at test time. Reconstruction ability 
of the autoencoder is facilitated by approximate graph matching for aligning G with G. 


2 Related Work 


Graph Decoders in Deep Learning. Graph generation has been largely unexplored 
in deep learning. The closest work to ours is by Johnson [8], who incrementally 
constructs a probabilistic (multi)graph as a world representation according to a 
sequence of input sentences to answer a query. While our model also outputs a 
probabilistic graph, we do not assume having a prescribed order of construction 
transformations available and we formulate the learning problem as an autoen- 
coder. 

Xu et al. [23] learns to produce a scene graph from an input image. They 
construct a graph from a set of object proposals, provide initial embeddings to 
each node and edge, and use message passing to obtain a consistent prediction. 
In contrast, our method is a generative model which produces a probabilistic 
graph from a single opaque vector, without specifying the number of nodes or 
the structure explicitly. 


Discrete Data Decoders. Text is the most common discrete representation. Gen- 
erative models there are usually trained in a maximum likelihood fashion by 
teacher forcing [22], which avoids the need to backpropagate through output dis- 
cretization but may lead to expose bias [1]. Recently, efforts have been made to 
overcome this problem by using Gumbel distribution [10] or reinforcement learn- 
ing [24]. Our work also circumvents the non-differentiability problem, namely by 
formulating the loss on a probabilistic graph. 
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Molecule Decoders. Generative models may become promising for de novo design 
of molecules fulfilling certain criteria by being able to search for them over a 
continuous embedding space [14]. While molecules have an intuitive and richer 
representation as graphs, the field has had to resort to textual representations 
with fixed syntax, e.g. so-called SMILES strings, to exploit recent progress made 
in text generation with RNNs [5, 14,16]. As their syntax is brittle, many invalid 
strings tend to be generated, which has been recently addressed by [11] by incor- 
porating grammar rules into decoding. While encouraging, their approach does 
not guarantee semantic (chemical) validity, similarly as our method. 


3 Method 


Our method is formulated in the framework of variational autoencoders (VAE) 
[9]. The main idea is to output a probabilistic fully-connected graph and use a 
graph matching algorithm to align it to the ground truth. We briefly recapitulate 
VAE below and continue with introducing our novel graph decoder together with 
an appropriate objective. 


3.1 Variational Autoencoder 


Let G = (A,E,F) be a graph specified with its adjacency matrix A, edge 
attribute tensor E, and node attribute matrix F. We wish to learn an encoder 
and a decoder to map between the space of graphs G and their continuous 
embedding z € R°, see Fig. 1. In the probabilistic setting of a VAE, the encoder 
is defined by a variational posterior qy(z|G) and the decoder by a generative 
distribution po(G|z), where ¢ and 6 are learned parameters. Furthermore, there 
is a prior distribution p(z) imposed on the latent code representation as a reg- 
ularization; we use a simplistic isotropic Gaussian prior p(z) = N(0, 1). The 
whole model is trained by minimizing the upper bound on negative log-likelihood 
— log po(G) [9]: 


L(8,0;G) = Eq, (216) log po(G|z)] + KL[q4(z|G)||p(2)] (1) 


The first term of £, the reconstruction loss, enforces high similarity of sampled 
generated graphs to the input graph G. The second term, KL-divergence, regu- 
larizes the code space to allow for sampling of z directly from p(z) instead from 
q(z|G) later. While the regularization is independent on the input space, the 
reconstruction loss must be specifically designed for each input modality. 


3.2 Probabilistic Graph Decoder 


In a related task of text sequence generation, the currently dominant approach 
is character-wise or word-wise prediction [2]. However, graphs can have arbi- 
trary connectivity and there is no clear way how to linearize their construction 
in a sequence of steps: Vinyals et al. [21] empirically found out that the lin- 
earization order matters when learning on sets. On the other hand, iterative 
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construction of discrete structures during training without step-wise supervision 
involves discrete decisions, which are not differentiable and therefore problematic 
for back-propagation. 

Fortunately, the task can become much simpler if we restrict the domain to 
the set of all graphs on maximum k nodes, where k is fairly small (in practice 
up to the order of tens). Under this assumption, handling dense graph repre- 
sentations is still computationally tractable. We propose to make the decoder 
output a probabilistic fully-connected graph G = (A, E, F) on k nodes at once. 
This effectively sidesteps both problems mentioned above. 

In probabilistic graphs, the existence of nodes and edges is modeled as 
Bernoulli variables, whereas node and edge attributes are multinomial variables. 
While not discussed in this work, continuous attributes could be easily modeled 
as Gaussian variables represented by their mean and variance. We assume all 
variables to be independent. 7 

Each tensor of the representation of G has thus a probabilistic interpretation. 
Specifically, the predicted adjacency matrix A € [0,1]*** contains both node 
probabilities den and edge probabilities Ags for nodes a Æ b. The edge attribute 
tensor E € Rtxkxd. indicates class probabilities for edges and, similarly, the 
node attribute matrix F € R**dr contains class probabilities for nodes. 

The decoder itself is deterministic. Its architecture is a simple multi-layer per- 
ceptron (MLP) with three outputs in its last layer. Sigmoid activation function 
is used to compute A, whereas edge- and node-wise softmax is applied to obtain 
E and F, respectively. At test time, we are often interested in a (discrete) point 
estimate of G , which can be obtained by taking edge- and node-wise argmax in 
A, E, and F. Note that this can result in a discrete graph on less than k nodes. 


3.3 Reconstruction Loss 


Given a particular instance of a discrete input graph G on n < k nodes and its 
probabilistic reconstruction G on k nodes, evaluation of Eq. 1 requires compu- 
tation of likelihood pg(G|z) = P(G|G). 

Since no particular ordering of nodes is imposed in either G or G and matrix 
representation of graphs is not invariant to permutations of nodes, comparison 
of two graphs is hard. However, approximate graph matching described further 
in Subsect.3.4 can obtain a binary assignment matrix X € {0,1}**", where 
Xa,¡ = 1 only if node a € G is assigned to ¿€ G and Xu, = 0 otherwise. 

Knowledge of X allows to map information between both graphs. Specifically, 
input adjacency matrix is mapped to the predicted graph as A’ = XAXT, 
whereas the predicted node attribute matrix and slices of edge attribute matrix 
are transferred to the input graph as F” = XTF and E’ , = XTE..ıX. The 
maximum likelihood estimates, i.e. cross-entropy, of respective variables are as 
follows: 
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log p(A’|z) = 1/k 5 Aaa log Alle A; a) log(l — Au 


+ 1/k(k — 1) Y A, ¿log Aa» + (1 — Al, o) log(1 — Aa) 
a+b 


7 (2) 

log p(F|z) =1/n Y log FP FY 

log p(Elz) = 1/(||Allı — n) log Ef, EY, 
iAj 


where we assumed that F and E are encoded in one-hot notation. The for- 
mulation considers existence of both matched and unmatched nodes and edges 
but attributes of only the matched ones. Furthermore, averaging over nodes and 
edges separately has shown beneficial in training as otherwise the edges dominate 
the likelihood. The overall reconstruction loss is a weighed sum of the previous 
terms: 


— log p(G|z) = —Aa log p(A’|z) — Ar log p(F lz) — Ag log p(Elz) (3) 


3.4 Graph Matching 


The goal of (second-order) graph matching is to find correspondences X € 
{0,1}**" between nodes of graphs G and G based on the similarities of their 
node pairs S : (i,j) x (a,b) > R+ for i,j € G and a,b € G. It can be 
expressed as integer quadratic programming problem of similarity maximiza- 
tion over X and is typically approximated by relaxation of X into continuous 
domain: X* € [0,1]**” [4]. For our use case, the similarity function is defined 
as follows: 


S((i, j), (a, b)) = (El, .Ea,5,.) Ai,j Aa, 40,045 oli # J Na A b)+ 
+ (FE Fa. )Aa ali =j Aa =b] 


y 


(4) 


The first term evaluates similarity between edge pairs and the second term 
between node pairs, [-] being the Iverson bracket. Note that the scores con- 
sider both feature compatibility (F and E) and existential compatibility (A), 
which has empirically led to more stable assignments during training. To sum- 
marize the motivation behind both Eqs. 3 and 4, our method aims to find the 
best graph matching and then further improve on it by gradient descent on the 
loss. Given the stochastic way of training deep networks, we argue that solving 
the matching step only approximately is sufficient. This is conceptually similar 
to the approach for learning to output unordered sets [21], where the closest 
ordering of the training data is sought. 

In practice, we are looking for a graph matching algorithm robust to noisy 
correspondences which can be easily implemented on GPU in batch mode. Max- 
pooling matching (MPM) by [4] is a simple but effective algorithm following the 
iterative scheme of power methods. It can be used in batch mode if similarity 
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tensors are zero-padded, i.e. S((i, j), (a,b)) = 0 for n < i,j < k, and the amount 
of iterations is fixed. 

Max-pooling matching outputs continuous assignment matrix X*. Unfortu- 
nately, attempts to directly use X* instead of X in Eq. 3 performed badly, as did 
experiments with direct maximization of X* or soft discretization with softmax 
or straight-through Gumbel softmax [7]. We therefore discretize X* to X using 
Hungarian algorithm to obtain a strict one-on-one mapping. While this opera- 
tion is non-differentiable, gradient can still flow to the decoder directly through 
the loss function and training convergence proceeds without problems. Note that 
this approach is often taken in works on object detection, e.g. [19], where a set 
of detections need to be matched to a set of ground truth bounding boxes and 
treated as fixed before computing a differentiable loss. 


3.5 Further Details 


Encoder. A feed forward network with edge-conditioned graph convolutions 
(ECC) [17] is used as encoder, although any other graph embedding method 
is applicable. As our edge attributes are categorical, a single linear layer for 
the filter generating network in ECC is sufficient. As usual in VAE, we formu- 
late the encoder as probabilistic and enforce Gaussian distribution of q¢(z|G) 
by having the last encoder layer outputs 2c features interpreted as mean and 
variance, allowing to sample zı ~ N (jı(G),oı(G)) for l € 1,..,c using the re- 
parameterization trick [9]. 


Disentangled Embedding. In practice, rather than random drawing of graphs, 
one often desires more control over generated graphs. In such case, we follow 
[18] and condition both encoder and decoder on label vector y associated with 
each input graph G. Decoder pg(G|z, y) is fed a concatenation of z and y, while 
in encoder qy(z|G, y), y is concatenated to every node’s features just before 
the graph pooling layer. If the size of latent space c is small, the decoder is 
encouraged to exploit information in the label. 


Limitations. The proposed model is expected to be useful only for generating 
small graphs. This is due to growth of GPU memory requirements and number of 
parameters (O(k?)) as well as matching complexity (O(k*)), with small decrease 
in quality for high values of k. In Sect. 4 we demonstrate results for up to k = 38. 
Nevertheless, for many applications even generation of small graphs is still very 
useful. 


4 Evaluation 


We demonstrate our method for the task of molecule generation by evaluating 
on two large public datasets of organic molecules, QM9 and ZINC. 
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4.1 Application in Cheminformatics 


Quantitative evaluation of generative models of images and texts has been trou- 
blesome [20], as it very difficult to measure realness of generated samples in 
an automated and objective way. Thus, researchers frequently resort there to 
qualitative evaluation and embedding plots. However, qualitative evaluation of 
graphs can be very unintuitive for humans to judge unless the graphs are planar 
and fairly simple. 

Fortunately, we found graph representation of molecules, as undirected 
graphs with atoms as nodes and bonds as edges, to be a convenient testbed 
for generative models. On one hand, generated graphs can be easily visualized 
in standardized structural diagrams. On the other hand, chemical validity of 
graphs, as well as many further properties a molecule can fulfill, can be checked 
using software packages (SanitizeMol in RDKit [12]) or simulations. This makes 
both qualitative and quantitative tests possible. 

Chemical constraints on compatible types of bonds and atom valences make 
the space of valid graphs complicated and molecule generation challenging. In 
fact, a single addition or removal of edge or change in atom or bond type 
can make a molecule chemically invalid. Comparably, flipping a single pixel in 
MNIST-like number generation problem is of no issue. 

To help the network in this application, we introduce three remedies. First, 
we make the decoder output symmetric A and E by predicting their (upper) 
triangular parts only, as undirected graphs are sufficient representation for 
molecules. Second, we use prior knowledge that molecules are connected and, at 
test tune only, construct maximum spanning tree on the set of probable nodes 
{a : Aga > 0.5} in order to include its edges (a,b) in the discrete pointwise 
estimate of the graph even if A a,b < 0.5 originally. Third, we do not generate 
Hydrogen explicitly and let it be added as “padding” during chemical validity 
check. 


4.2 QM9 Dataset 


QM9 dataset [15] contains about 134k organic molecules of up to 9 heavy (non 
Hydrogen) atoms with 4 distinct atomic numbers and 4 bond types, we set k = 9, 
de = 4 and dn = 4. We set aside 10k samples for testing and 10k for validation 
(model selection). 

We compare our unconditional model to the character-based generator of 
Gémez-Bombarelli et al. [5] (CVAE) and the grammar-based generator of Kusner 
et al. [11] (GVAE). We used the code and architecture in [11] for both baselines, 
adapting the maximum input length to the smallest possible. In addition, we 
demonstrate a conditional generative model for an artificial task of generating 
molecules given a histogram of heavy atoms as 4-dimensional label y, the success 
of which can be easily validated. 


Setup. The encoder has two graph convolutional layers (32 and 64 channels) with 
identity connection, batchnorm, and ReLU; followed by the graph-level output 
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formulation in Eq. 7 in [13] with auxiliary networks being a single fully connected 
layer (FCL) with 128 output channels; finalized by a FCL outputting (u, 0). The 
decoder has 3 FCLs (128, 256, and 512 channels) with batchnorm and ReLU; 
followed by parallel triplet of FCLs to output graph tensors. We set c = 40, 
Aa = Ar = Apg = 1, batch size 32, 75 MPM iterations and train for 25 epochs 
with Adam with learning rate le-3 and 51 =0.5. 


Embedding Visualization. To visually judge the quality and smoothness of the 
learned embedding z of our model, we may traverse it in two ways: along a 
slice and along a line. For the former, we randomly choose two c-dimensional 
orthonormal vectors and sample z in regular grid pattern over the induced 2D 
plane. Figure 2 shows a varied and fairly smooth mix of molecules (for uncondi- 
tional model with c = 40 and within 5 units from the origin). For the latter, we 
randomly choose two molecules G),G@) of the same label from test set and 
interpolate between their embeddings (GV), 1(G@)). This also evaluates the 
encoder, and therefore benefits from low reconstruction error. In Fig.3 we can 
find both meaningful (1st, 2nd and 4th row) and less meaningful transitions, 
though many samples on the lines do not form chemically valid compounds. 


Decoder Quality Metrics. The quality of a conditional decoder can be evaluated 
by the validity and variety of generated graphs. For a given label y, we draw 
ns = 104 samples z("*) ~ p(z) and compute the discrete point estimate of their 
decodings G5) = arg max po (Gz), y). 

Let V be the list of chemically valid molecules from G5) and C be 
the list of chemically valid molecules with atom histograms equal to y(). We 
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are interested in ratios Valid) = |VO|/ns and Accurate) = |CO|/n,. Fur- 
thermore, let Unique = |set(C)|/|C©| be the fraction of unique correct 
graphs and Novel” = 1 - |set(C) N QM9|/|set(C™)| the fraction of novel 
out-of-dataset graphs; we define Unique = 0 and Novel” = 0 if [CO] = 0. 
Finally, the introduced metrics are aggregated by frequencies of labels in QM9, 
e.g. Valid = Y, Valid freq(y). Unconditional decoders are evaluated by 
assuming there is just a single label, therefore Valid = Accurate. 

In Table 1, we can see that on average 50% of generated molecules are chem- 
ically valid and, in the case of conditional models, about 40% have the correct 
label which the decoder was conditioned on. Larger embedding sizes c are less 
regularized, demonstrated by a higher number of Unique samples and by lower 
accuracy of the conditional model, as the decoder is forced less to rely on actual 
labels. The ratio of Valid samples shows less clear behavior, likely because the 
discrete performance is not directly optimized for. For all models, it is remarkable 
that about 60% of generated molecules are out of the dataset, i.e. the network 
has never seen them during training. 

Looking at the baselines, CVAE can output only very few valid samples as 
expected, while GVAE generates the highest number of valid samples (60%) but 
of very low variance (less than 10%). Additionally, we investigate the importance 
of graph matching by using identity assignment X instead and thus learning to 
reproduce particular node permutations in the training set, which correspond 
to the canonical ordering of SMILES strings from RDKit. This ablated model 
(denoted as NoGM in Table 1) produces many valid samples of lower variety 
and, surprisingly, outperforms GVAE in this regard. In comparison, our model 
can achieve good performance in both metrics at the same time. 


Likelihood. Besides the application-specific metric introduced above, we also 
report evidence lower bound (ELBO) commonly used in VAE literature, which 
corresponds to —L(¢,0; G) in our notation. In Table 1, we state mean bounds 
over test set, using a single z sample per graph. We observe both reconstruction 
loss and KL-divergence decrease due to larger c providing more freedom. How- 
ever, there seems to be no strong correlation between ELBO and Valid, which 
makes model selection somewhat difficult. 


4.3 ZINC Dataset 


ZINC dataset [6] contains about 250k drug-like organic molecules of up to 38 
heavy atoms with 9 distinct atomic numbers and 4 bond types, we set k = 38, 
de = 4 and dn = 9 and use the same split strategy as with QM9. We investigate 
the degree of scalability of an unconditional generative model. The setup is 
equivalent as for QM9 but with a wider encoder (64, 128, 256 channels). 

Our best model with c = 40 has archived Valid = 0.135, which is clearly 
worse than for QM9. For comparison, CVAE failed to generated any valid sam- 
ple, while GVAE achieved Valid = 0.357 (models provided by [11], c = 56). We 
attribute such a low performance to a generally much higher chance of producing 
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Table 1. Performance on conditional and unconditional QM9 models evaluated by 
mean test-time reconstruction log-likelihood (logps(G|z)), mean test-time evidence 
lower bound (ELBO), and decoding quality metrics (Sect. 4.2). Baselines CVAE [5] 
and GVAE [11] are listed only for the embedding size with the highest Valid. 


log po(G|z) ELBO Valid Accurate Unique Novel 


Ours c = 20 -0.578 -0.722 0.565 0.467 0.314 0.598 
Ours c = 40 -0.504 -0.617 0.511 0.416 0.484 0.635 
Ours c = 60 -0.492 -0.585 0.520 0.406 0.583 0.613 
Ours c = 80 -0.475 -0.557 0.458 0.353 0.666 0.661 


Cond. 


Ours c = 20 -0.660 -0.916 0.485 0.485 0.457 0.575 


E Ours c = 40 -0.537 -0.744 0.542 0.542 0.618 0.617 
3 Ours c = 60 -0.486 -0.656 0.517 0.517 0.695 0.570 
E Ours c = 80 -0.482 -0.628 0.557 0.557 0.760 0.616 
3 NoGM c= 80 -2.388 -2.553 0.810 0.810 0.241 0.610 
P CVAE c = 60 - - 0.103 0.103 0.675 0.900 

GVAE c = 20 - - 0.602 0.602 0.093 0.809 


a chemically-relevant inconsistency (number of possible edges growing quadrat- 
ically). To confirm the relationship between performance and graph size k, we 
kept only graphs not larger than k = 20 nodes, corresponding to 21% of ZINC, 
and obtained Valid = 0.341 (and Valid = 0.185 for k = 30 nodes, 92% of ZINC). 


5 Conclusion 


In this work we addressed the problem of generating graphs from a continuous 
embedding in the context of variational autoencoders. We evaluated our method 
on two molecular datasets of different maximum graph size. While we achieved 
to learn embedding of reasonable quality on small molecules, our decoder had a 
hard time capturing complex chemical interactions for larger molecules. Never- 
theless, we believe our method is an important initial step towards more powerful 
decoders and will spark interest in the community. 


Acknowledgments. We thank Shell Xu Hu for discussions on variational methods, 
Shinjae Yoo for project motivation, and anonymous reviewers for their comments. 
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Abstract. Many variants of a sampling-based motion planning algo- 
rithm, namely Rapidly-exploring Random Tree, use biased-sampling for 
faster convergence. One of such recently proposed variant, the Hybrid- 
Augmented CL-RRT+, uses a predicted predefined template trajectory 
with a machine learning algorithm as a reference for the biased sam- 
pling. Because of the finite number of template trajectories, the conver- 
gence time is short only in scenarios where the final trajectory is close 
to predicted template trajectory. Therefore, a generative model using 
variational autoencoder for generating many reference trajectories and a 
3D-ConvNet regressor for predicting those reference trajectories for crit- 
ical vehicle traffic-scenarios is proposed in this work. Using this frame- 
work, two different safe trajectory planning algorithms, namely GATE 
and GATE-ARRT+, are presented in this paper. Finally, the simulation 
results demonstrate the effectiveness of these algorithms for the trajec- 
tory planning task in different types of critical vehicle traffic-scenarios. 


Keywords: Safe trajectory planning - Hybrid machine learning 
Variational autoencoder 


1 Introduction 


Autonomous driving is one of the area extensively being researched, in both 
academia and industry, because of its expected immense social and economic 
impacts. In order to realize a fully autonomous driving, the vehicle must be able 
to plan a trajectory with simultaneous intervention in the lateral and longitu- 
dinal dynamics of the vehicle for the collision avoidance/mitigation in critical, 
dynamic traffic-scenarios as well as for smooth and comfortable traveling. 

Many motion planning algorithms have been proposed in the literature sum- 
marized in [1]. A probabilistic sampling algorithm ‘Rapidly-exploring Random 
Tree’ (RRT) [2] is most popular because of its fast runtimes and ability to plan 
the path with dynamic constraints without discretizing the state-space. Many 
© Springer Nature Switzerland AG 2018 
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variants of this algorithm have been developed for different applications, as sum- 
marized in [3]. Only few of these variants [4,5] claim to run in real time with 
dynamic constraints. However, they either require precomputation of many safe 
states or high performance computers. 

RRT is a probabilistically complete algorithm, i.e., it always finds a solution, 
if it exists, given infinite time. Therefore, many approaches define rule-based 
heuristics for biased-sampling [7-12] to increase the convergence rate. Never- 
theless, all of these methods require an initial approximate solution for biased- 
sampling. 

Machine learning algorithms can be used to find solutions for complex prob- 
lems with short inference time. Since they are purely data-based methods they 
are seen as black-box methods. Therefore, they are not used in safety critical 
applications like vehicle trajectory planning. A learned Gaussian Mixture Mod- 
els distribution is used for the biased-sampling in learned free spaces in [13] to 
decrease the number of collision checks drastically for the trajectory planning 
with the RRT algorithm. In another approach [15], a conditional variational 
autoencoder is used to generate biased samples in space from a learned sampling 
distribution to increase the convergence rate of the RRT algorithm. However, 
both approaches perform biased-sampling in space only. 

The use of hybrid machine learning algorithms, a combination of machine 
learning algorithms and model-based search algorithms, opens a new way of 
using machine learning algorithms in safety critical applications. AlphaGo [16] 
and ExIT [17] are two examples of guided tree search algorithms with neural 
networks for the board games Go and Hex, respectively. But, these algorithms 
are limited to discrete state-spaces and action-spaces. The Hybrid Augmented 
CL-RRT (HARRT) [14] and the Hybrid Augmented CL-RRT+ (HARRT+) [18] 
are examples of hybrid machine learning algorithms for safe trajectory planning 
in complex, critical traffic scenarios which use 3D convolutional neural networks 
(3D-ConvNets) [19], in combination with RRT variants the Augmented CL-RRT 
(ARRT) [6] and the Augmented CL-RRT+ (ARRT+) [18], respectively. The 
HARRT+ algorithm is described briefly in Sect.3 along with its drawbacks. 

Other approaches of developing generative models for trajectories have been 
proposed for different applications such as handwriting generation [21] and pre- 
dicting basketball trajectories [22]. In this work, a methodology for generat- 
ing better reference trajectories with two machine learning algorithms is pro- 
posed. First one is a generative model for trajectory generation using a varia- 
tional autoencoder (VAE) [23] and second is a 3D-ConvNet regressor for pre- 
dicting those reference trajectories for critical traffic-scenario. Two different 
motion planning algorithms are also presented by combining this machine learn- 
ing framework with an optimization procedure and the ARRT+ algorithm to 
decrease the convergence time further. 

The paper is organised as follows: Sect.2 briefly explains VAEs. Section 3 
describes HARRT+ motion planning algorithm with its drawbacks. Two machine 
learning algorithms, a generative model for trajectories and 3D-ConvNet regres- 
sor, are presented in Sect. 4. Based on this framework, two new vehicle motion 
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planning algorithms namely, GATE and GATE-ARRT+, are proposed in Sect. 5 
followed by results and a conclusion. 

Throughout this paper, upper case bold letters denote matrices and lower 
case bold letters denote vectors. 


2 Variational Autoencoder 


This Section briefly reviews Variational autoencoder (VAE) [23] that is a key 
mechanism used for developing a generative model for vehicle trajectories. It tries 
to minimize the difference between model distribution Pg(X) with parameters 
9 and data distribution Pjata(X), given a data-set X = {x;}%_, of N identical 
and independent samples of some discrete or continuous random variable æ. It 
assumes that this data-set is generated with a two-step random process using a 
latent variable z. First, a realization of z is sampled from a prior distribution 
Po(z). Then, X is generated from a conditional distribution Py (X|z). The goal 
is to maximize the probability of observing realizations X according to 


P(X) = / Po(X|z)Po(2)dz. (1) 


The problem with above equation is that it is intractable as it is impossible to find 
Pa(X) for every z. Also, the posterior distribution Pa(z|X) is also intractable. 
VAE proposes a solution for this by defining an encoder model Q¿(z|X) that 
approximates Pg(z|X). As X is fixed and P(X) is not dependent on Q¢(z|X), 
the log likelihood of the data can be found by taking the expectation with respect 
to z using an encoder network Q4(z|X) such that 


log Po(X) = EznQs (z|X) [log Po(X)] ; (2) 


Applying Bayes’ Rule to Eq. 2, the equation becomes 


(3) 


log Po(X) = Exx tow PAARO | 


Po(z|X) 


where E,_0,(z]x) 18 replaced by E,|x to avoid the clutter. Multiplying and 
dividing by Q4(z|X) and applying logarithmic rules, we get 


log Po(X) = Ex [log Po(X |2)]-Ez|x ow en +Eax ow LE ` 


Writing above equation with KL-terms, log data likelihood becomes 


log Pa (X) = Ez\x [log Pa (X|z)] — Drr(Qe(2|X)||Po(2)) + 
L(X,0,p) 

Drr(Qp(z|X)!1Po (21 X)) . 

SS 


>0 


(5) 
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The estimate of the first term on right hand side of the Eq.5 can be computed by 
the decoder network through sampling. This non-continuous sampling procedure 
is made differentiable through reparameterization technique [23] required for 
backpropagation. Generally, cross-entropy or root mean square error criteria is 
considered for calculating the reconstruction loss. The second term in Eq.5 of KL 
divergence between approximate posterior and the prior distribution is possible 
to compute. This is because the approximate posterior Q¢(z|X) is often chosen 
as a multivariate Gaussian with diagonal covariance matrix whose distribution 
parameters are learnt from the data while the prior Pa(z) is commonly chosen 
as isotropic multivariate Gaussian. The third term in Eq.5 is intractable as 
Po(z|X) is intractable. But as per the definition of KL-Divergence it is always 
equal to or greater than 0. The first two terms together are termed as variational 
lower bound £(X,0,@) and the goal becomes to maximize the lower bound to 
find the optimal 0* and &* such that 


0*, ġ* = arg max£(X,0, 6). (6) 
0 


> 


3 HARRT+ Algorithm 


The ARRT+ algorithm considers vehicle nonlinear dynamics for trajectory plan- 
ning in the form 


s(t) = f(s(t), u(t)), (7) 


where u(t) € R™ is the control input and s(t) is the area occupied by the 
EGO vehicle at time t which is the subspace in R?. In an iterative process, 
this algorithms construct a tree 7 with multiple safe states s(t). Throughout 
this paper, the term safe used in context of states and trajectories means either 
collision-free or with a predicted nonsevere collision. In every iteration, a random 
point Srana is sampled with some bias towards a goal region Sgoal. The state 
Snearest(t) which are previously stored in the tree 7 nearest to Srana is found. 
The tree is extended by an incremental motion towards Srand from Snearest(t). 
The incremental extension is performed for the time interval At using differential 
constraints f as in Eq.(7) to get the new state Snewlt + At). The new state 
Snew(t+At) is added to the tree T, if the trajectory from Snearest(t) tO Snew(t+ 
At) is collision-free or it encounters a collision with predicted low severity. A 
two-track model [20] is used as a constraint f while extending the tree. 

A traffic-scenario is converted into a sequence of predicted occupancy grids 
M = (Gto, +. Ftor y for the prediction interval [to, tp +71] with each occupancy 
grid G; representing the occupancies of road objects at time t. The cells in the 
predicted occupancy grid G; which lie outside of the road or occupied by other 
vehicle at time t are assigned a value 1. Rest of the grids are assigned a value 
0 indicating they are free. A scenario described by {M, n}, where y are EGO 
vehicle physical parameters like velocity, yaw-rate, etc. Due to the 3D structure 
of the input M, the 3D-ConvNet is used as a machine learning algorithm. In a 
simulation environment developed in Matlab, many critical traffic-scenarios are 
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simulated and best trajectories m* are found by the ARRT+ algorithm. The 
steering wheel angle profile and longitudinal acceleration profile are extracted 
from the found trajectories and their clusters are formed using hierarchical clus- 
tering based on Euclidean criteria. The combination of these clusters in which 
the acceleration profile and steering wheel angle profile of the best trajectory for 
a scenario lies is considered as a label for that scenario. The HARRT+ algorithm 
uses the mean vectors of the predicted acceleration and steering wheel angle clus- 
ters to generate a reference trajectory that is used for the biased-sampling to 
increase the convergence speed of the algorithm. Basically, HARRT+ algorithm 
predicts a template trajectory 7, from total T template trajectories formed by 
combination of mean vectors of all acceleration and steering wheel angle profile 
clusters. The reference acceleration profile ây and waypoints W are extracted 
from this template trajectory for simultaneous biased sampling in the lateral 
and longitudinal dynamics. Figure 1 explains the procedure for finding r* with 
HARRT-+ algorithm. 


(Mn Lemos te - 1 > 
LOW _! ARRT+, 7* 
andom 
DE = 


Fig. 1. HARRT+ algorithm 


HARRT+ algorithm uses combination of biased and random sampling algo- 
rithm. Therefore, it still have the property of probabilistic completeness even 
with wrong prediction of template trajectory f+. However, the computation time 
for finding a safe trajectory is high when a wrong cluster is predicted because of 
the wrong bias generation. Even if a right template trajectory is predicted, the 
final safe trajectory may not always lie near it (as it can lie on the boundary of 
clusters) or the final safe trajectory has very different shape compared to 7;. In 
such situations as well, it is observed that HARRT-+ algorithm converges slowly. 


4 Generation of Reference Trajectories 


From the explanation of the drawbacks of HARRT-+ algorithm, it is clear that 
the final computation time required for trajectory planning with HARRT-+ algo- 
rithm strongly depends on the quality of the predicted reference trajectory, i.e., 
closer the predicted reference trajectory 7; in distance and shape to m* lesser 
the computation time will require to find *. This will not be possible in all 
scenarios with finite number of template trajectories. Therefore, a generative 
model for trajectory generation using VAE is proposed with which many refer- 
ence trajectories can be generated. This trained VAE is further used in the label 
generation and inference procedure of the other machine learning algorithm, 
i.e., 3D-ConvNet regressor, which maps the traffic-scenarios to the reference 
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trajectories. This section describes training procedure for both machine learning 
algorithms and its usage for predicting reference trajectories for vehicle critical 
traffic-scenarios. 


4.1 Generative Model for Trajectories m 


In order to train VAE for trajectories, 60000 different trajectories for time 7; 
(=2s) are generated using the two-track vehicle dynamic model [20] with differ- 
ent initial velocities, lateral and longitudinal dynamic intervention over entire 
trajectory with actuator and stable profile constraints as mentioned in [18]. These 
trajectories are provided as input to the VAE in the form 


T= {re ? Tetg+At we. Par ’ Tuto ? Tyigt At? one Tyr, $ (8) 


where r,,, and ry, are the coordinates of the center of gravity of the vehicle at 
time t;. The encoder Q¢(z|7) maps trajectories to latent space mean vector z,,|7 
and standard deviation vector z,|7 each of dimension 2. To avoid clutter, they 
are simply written as z,, and Zs. As per the reparameterization trick, the samples 
z are obtained by sampling e from N (0,1) and performing operation z, + €z,. 
The decoder Pg(7|z) reconstruct the trajectories using samples generated from 
Z, and 2, as shown in Fig.2. The root mean square criteria is used for the 
reconstruction loss. Also, the trajectories are normalized before first layer in 
the encoder and the final trajectories m are obtained by denormalization and 
smoothing with moving average filter. The activation function used in each layer 
of the encoder and decoder is hyperbolic tan. 


7 >=] > 
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Fig. 2. VAE for trajectories 


4.2 3D-ConvNet Regressor 


The task of 3D-ConvNet here is to predict the value of continuous variable z, 
instead of predicting only finite class labels as in HARRT+ algorithm. Therefore, 
the 3D-ConvNet is used as regressor with the input {M,n} and correspond- 
ing target values z,. The architecture of 3D-ConvNet used is same as in the 
HARRT+ algorithm except the loss function calculation criteria changed from 
the cross-entropy to the root mean square error. 


Generation of Reference Trajectories for Safe Trajectory Planning 429 


The label generation procedure for the 3D-ConvNet regressor is explained 
in Fig. 3. For each traffic-scenario {M, n}, the best trajectory * is found with 
the ARRT+ algorithm in the Matlab simulation environment. This trajectory 
is fed to the encoder Q¢(z|7) of trained VAE to find corresponding z, which is 
assigned as a label for that scenario. In total 44692 curved road critical traffic- 
scenarios with different radius of curvatures, number and type of objects are 
used. 


{Mn} ARETES a Qslzlm), 2 
Fig. 3. Label generation using VAE 


The inference procedure for 3D-ConvNet is defined in Fig. 4. When a traffic- 
scenario {M,n} is encountered, the trained 3D-ConvNet is used to predict 2, 
which is directly fed to the decoder network Pg(7|z), eliminating reparameter- 
ization trick to get value of sample z as in VAE, to get the predicted reference 
trajectory 7. 


{M, n} ia ae Po(r|z), u 


Fig. 4. Inference using VAE 


5 Vehicle Motion Planning Algorithms 


5.1 Generative Algorithm for Trajectory Exploration (GATE) 


Because of the probabilistic nature of VAE, the latent space generated in VAE is 
continuous unlike in simple autoencoders where deterministic mapping is used. 
An optimization procedure can be carried out to find the optimal latent variable 
values z*, which generate the best trajectory m* using decoder Pg(7|z), from the 
randomly initialized z. The cost function J can be defined as per the application 
based on criterias such as safety, comfort, etc. As the goal is to find trajectories 
for the collision avoidance, the area occupied by the EGO vehicle during the 
whole trajectory should not intersect with non-free area, i.e., area occupied by 
other road participants and area outside of the road. Simultaneously, the criteria 
of keeping as large as possible distance from other road participants is added so 
that a small variation in other road participants prediction does not lead to a 
collision. Therefore, the optimal z* is found such that 


z* = arg min [J] 


= arg min (Sur) A Siw Py (x|z)(t)) — dra Po (miz) | > (9) 


E t 
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where S,,(t) is the non-free area of the road at time t, i.e., area outside of 
the road and area within the road occupied by other road participants at time 
t, San P (z|z)(t) is the area occupied by the EGO vehicle at time t along the 
trajectory m obtained by feeding z to decoder Pg(rr|z) and dmin is the shortest 
distance between the Sr. p, (r|2)(t) and S, f(t) over the whole trajectory m in 
time interval t = [to,to + 71]. The first term on the right hand side of Eq. 9, is 
the summation of intersection of non-free area of the road with EGO vehicle 
along the trajectory m. The goal is to make this term zero and increase dmin- 
The optimization solver used is a Matlab function for Nelder-Mead Simplex 
method [24]. 

The final trajectory obtained by this procedure is highly dependent on the 
initialization of the latent variable values. With wrong initialization, it may get 
trapped in a local minima leading to suboptimal values which could generate a 
trajectory with severe collision. Therefore, the trained 3D-ConvNet is used to 
predict the initial values of the latent variables 2 which should already very close 
to z*. This whole procedure is shown in Fig.5 and this algorithm is named as 
Generative Algorithm for Trajectory Exploration (GATE). 


E 1. Pi a 1 
{M, pes eee e(r/z) î | 


Fig. 5. GATE algorithm 


5.2 GATE-ARRT+ 


Although GATE provides an opportunity to sample trajectories directly, it is 
still not a probabilistic complete algorithm like RRT algorithm. This is because 
VAE only learns the approximate training data distribution and not true data 
distribution. Therefore, its capacity of generating trajectories is dependent on 
the training data. But, the reference trajectory generated by GATE can be 
used to bias the sampling of the ARRT+ algorithm to increase its convergence 
rate. This combination is named as the GATE-ARRT+ algorithm. As reference 
trajectories generated by the GATE algorithm are closer to best trajectories 
for that traffic-scenario compared to the reference trajectories predicted in the 


M,n} —> it = 
{M,n} GATE |—> BR a ERA 
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ae 
Samples | 


Fig.6. GATE-ARRT+ algorithm 
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HARRT+ algorithm, the GATE-ARRT+ algorithm converges even more rapidly. 
The procedure for finding best trajectories with the GATE-ARRT+ algorithm 
is shown in Fig. 6. 


6 Results 


In order to validate the effectiveness of the proposed vehicle motion planning 
algorithms, many different curved-road traffic-scenarios with different number 
of objects having different initial velocities, positions are simulated in Matlab 
simulation environment and safe trajectories with different motion planning algo- 
rithms such as ARRT+, HARRT+, GATE and GATE-ARRT+ are found. The 
search of collision-free trajectory is stopped when a collision-free trajectory is 
found or the maximum number of samples used. The maximum number of sam- 
ples N used for ARRT+ algorithm is 2100 as it uses pure random sampling 
while for the HARRT+ and GATE-HARRT+ 300 samples used. The number of 
iterations I for optimization procedure is limited to 10 and 2 with the GATE 
and GATE-ARRT+ algorithm, respectively. Note that this optimization proce- 
dure is optional in the GATE-ARRT-+ algorithm. The results are summarized 
in the Table1. The results show that the scenarios with less number of object 
(1-2 objects), the GATE algorithm is able to find a collision-free trajectory in 
the highest number of traffic-scenarios with shortest computation time because 
of lots of free space available. As the number of objects increases, the free space 
available decreases, and therefore the GATE algorithm converges in lesser num- 
ber of traffic-scenarios. In such cases, the GATE-ARRT+ algorithm is proven to 
be more effective. The more efficiency of the GATE-ARRT+ algorithm compared 
to the HARRT+ algorithm is because of better reference trajectory provided by 
the VAE. 

Figures 7, 8 and 9 shows the safe trajectory planned with algorithms GATE- 
ARRT+, HARRT+ and GATE algorithm in a traffic-scenario where a collision 
with pedestrian crossing the street is predicted. From figures it is clear that the 
HARRT+ algorithm required more number of samples compared to the GATE- 
ARRT+ algorithm. Also, the final trajectory (longest black trajectory) found 


Table 1. Comparison of Vehicle Motion Planning Algorithms 


ARRT+ HARRT+ |GATE | GATE-ARRT+ 
(N =2100) | (N=300) |(7=10)|(N=300, I=2) 

1-2 objects Time (Sec.) | 3.33 0.81 0.31 0.45 

(834 scenarios) |% Conv.  |97.52 96.04 98.92 | 97.24 

3-4 objects Time (Sec.) | 3.62 1.08 0.83 0.68 

(1728 scenarios) | % Conv. 92.99 91.14 72.22 | 93.17 

5-6 objects Time (Sec.) | 4.34 1.32 1.15 0.87 

(3625 scenarios) | % Conv. 89.02 86 57.98 | 88.02 
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Fig. 7. Simulation result with GATE-ARRT+ 
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Fig. 8. Simulation result with HARRT+ algorithm 
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Fig. 9. Simulation result with GATE algorithm 


by GATE-ARRT+ algorithm has smoother shape compared to ones found by 
HARRT+ and GATE algorithm. This example shows indeed a better reference 
trajectory will lead to better final trajectory found by the ARRT+ algorithm. 
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Conclusion 


This paper presents a methodology of using variational autoencoder for gener- 
ating many template trajectories which can be used for biased-sampling with a 
sampling-based motion planning algorithm. Two different motion planning algo- 
rithms, namely GATE and GATE-ARRT+, are proposed using the framework 
of generating trajectories with variational autoencoder. The simulation results 
not only demonstrate increase in the convergence speed compared to previously 
proposed sampling-based motion planning algorithms but also exemplarily show 
improvement in the quality of the final trajectory produced. 
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Abstract. In most cases, inverse problems are ill-posed or ill-conditioned, 
which is the reason for high sensitivity of their solution to noise in the input 
data. Despite the fact that neural networks have the ability to work with noisy 
data, in the case of inverse problems, this is not enough, because the incor- 
rectness of the problem “outweighs” the ability of the neural network. In pre- 
vious studies, the authors have shown that separate use of methods of group 
determination of parameters and of noise addition during training of neural 
networks can improve the resilience of the solution to noise in the input data. 
This study is devoted to the investigation of joint application of these methods. 
The study is performed at the example of an inverse problem in laser Raman 
spectroscopy - determination of concentrations of ions in a solution of inorganic 
salts by Raman spectrum of the solution. 


Keywords: Artificial neural networks - Perceptron 
Multi-parameter inverse problems - Noise resilience 
Group determination of parameters 


1 Introduction 


Inverse problems (IPs) represent a very important class of problems. Almost any 
problem of indirect measurements belongs to this class. IPs include many problems 
from the areas of geophysics [1], spectroscopy [2], various types of tomography [3], 
and many others. 
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Practical IPs have a number of features that significantly complicate their solution. 
As a rule, they are nonlinear and often have high input dimension and high output 
dimension (they are multi-parameter problems). In general, the IPs have no analytical 
solution, so in most cases they are solved numerically. 

Traditional methods for solving IPs are matrix methods using Tikhonov regular- 
ization [4], as well as optimization methods based on multiple solutions of the direct 
problem and minimization of the discrepancy in the space of the observed values [5]. 

However, traditional methods have a number of disadvantages. For methods based 
on regularization, the main difficulty is the choice of the regularization parameter. In 
addition, matrix methods are linear methods, so in order to use them to solve nonlinear 
problems, it is necessary to perform nonlinear data preprocessing. 

Optimization methods are characterized by high computational cost and require a 
good first approximation (in some cases obtained by alternative measurement meth- 
ods). The main disadvantage of optimization methods is the need to have a correct 
model of solving the direct problem, in the absence of which this method is not 
applicable. In addition, due to the incorrectness of IPs, a small discrepancy in the space 
of the observed values does not guarantee a small error in the space of the determined 
parameters [6]. 

Therefore, in this paper we consider artificial neural networks as an alternative that 
is free from the shortcomings inherent in traditional methods of solving IPs. 

In most cases, IPs are ill-posed or ill-conditioned, which is the reason for the high 
sensitivity of their solutions to noise in the input data, both for traditional methods and 
for neural networks. At the same time, the IP solutions will almost always deal with 
noisy data, because any measurements are characterized by some measurement error. 
As a result, the development of some approaches to improve the resilience of the IP 
solution to noise in the input data is an urgent task. 

Despite the fact that neural networks have the ability to work with noisy data, in the 
case of IPs, this is not enough, because the incorrectness of the problem often “out- 
weighs” the ability of the neural network. 

This study, as well as a number of previous works of the authors [7-10], is devoted 
to the development of approaches to improve the resilience of neural network solutions 
of multi-parameter inverse problems to noise in the input data. 

In [7, 8] it has been demonstrated that simultaneous determination of a group of 
parameters in some cases allows increasing the resilience of the neural network solution 
to noise in the input data. In this case, as a rule, the higher is the noise level, the more 
pronounced is the effect of using this approach. 

Adding noise during perceptron type neural network training showed itself as a 
useful method to improve the trained network in various respects. The basis for use of 
this method was founded in [9, 10], where it was demonstrated that it can improve the 
generalizing capabilities of the network. In [11] it was shown that use of this method is 
equivalent to Tikhonov regularization. In addition, it can be used to prevent network 
overtraining [12-14], as well as to speed up learning [15]. The method is also used in 
the training of deep neural networks [16]. In [17, 18], the authors used adding noise 
during training to increase noise resilience of trained perceptron type neural networks 
to noise in the input data, where it showed its effectiveness. 
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In this paper, the efficiency of the two named methods was compared at the data of 
a high-dimensional multi-parameter non-linear inverse problem in laser Raman spec- 
troscopy, and their combined use was investigated. 


2 Problem Statement 


The problem considered in this paper was to determine the concentrations of 10 ions 
(Cl, F, HCO; , K*, Li*, Mg**, Na*, NH¿*, NO; , SO,” ) contained in multi- 
component solutions of 10 inorganic salts (MgSO,, Mg(NO3)2, LiCl, LiNO3, NH4F, 
(NH4)2SO4, KF, KHCO3, NaHCO;, NaCl) by their Raman scattering spectra (Fig. 1). 
The investigated solutions contained | to all 10 of the salts in the concentration range 
0-1.5 M (mole/liter) with an increment of 0.15-0.25 M. The excitation of the spectra 
was performed with an argon laser with the wavelength of 488 nm. Spectrum regis- 
tration was carried out by a multi-channel detector based on a CCD matrix. For each 
solution, the spectrum was registered in 1824 channels in the range of Raman fre- 
quencies of 565...4000 cm '. The initial data set on which this study was performed 
contained 4445 patterns. 
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Fig. 1. Sample Raman spectra of multi-component solutions. 


The principle possibility of using Raman spectra to determine ion concentrations in 
a solution is due to the high sensitivity of the spectrum to the type and concentration of 
substances dissolved in water. Many complex ions (sulfides, sulfates, nitrates, phos- 
phates etc.) have their proper Raman bands in the region of 300-2000 cm! (Fig. 1, 
left) [19, 20]. The position of these lines strictly corresponds to the frequency of 
oscillations of molecular groups of these ions, and the intensity of the lines depends on 
their concentration in water. For the solution of several salts, the dependence of the line 
intensity on the concentration is non-linear. Monoatomic (simple) ions (e.g. Na*, CI”, 
K* etc.) have no proper Raman lines; however, they have an effect on the Raman 
valence band of the water itself (Fig. 1, right) [21-24]. At present, no adequate 
mathematical models describing such types of interactions are available; therefore, 
practically the only way to solve the problem under consideration is to use machine 
learning methods based on experimental data. 
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3 Description of the Noise 


For the considered problem, experimental data may contain the following types of data 
distortions: 


a. Inaccuracies in salt concentration values; 

b. Random noise in determining the intensity of the spectrum in various channels; 

c. Distortions arising from factors influencing the entire spectrum: excessive illumi- 
nation of the sample, change in laser power, channel shifting of the spectrum due to 
small changes of alignment etc. 


In this paper, we consider case b — Random noise in determining the intensity of the 
spectrum in various channels. 

Two types of noise were considered: additive and multiplicative, and two kinds of 
statistics: uniform noise (uniform noise distribution) and Gaussian noise (normal dis- 
tribution). The value of each observed feature was transformed as follows: 


x" = x; + norminv(random, u = 0,0 = noise level) - max(x;) 
xa" = x; + (1 — 2 - random) - noise level - max(x;) 
x" = x; + (1 + norminv(random, u = 0, 0 = noise level)) 


x" = x; - (1+ (1 — 2 - random) - noise level) 
for additive Gaussian (agn), additive uniform (aun), multiplicative Gaussian (mgn), and 
multiplicative uniform (mun) noise, respectively. Here random is a random value in the 
range from 0 to 1, norminv function returns the inverse normal distribution, max(x;) is 
the maximum value of the given feature over all patterns, noise level is the level of 
noise (the considered values were: 1%, 3%, 5%, 10%, 20%). 

When working with noisy data, each pattern of the initial training and test sets had 
10 implementations with noise. Each set contained noise of certain level, type and 
statistics. Including a set without noise, there were total 21 out-of-sample (examination) 
data sets: 5 noise levels x 2 noise types x 2 kinds of statistics + 1 = 21. 


4 Solving the Problem 


4.1 Use of Neural Networks 


To solve the problem, one of the most widespread neural network architectures was 
used — a multilayer perceptron (MLP). We used neural networks containing three 
hidden layers with 64, 32 and 16 neurons in the Ist, 2nd and 3rd hidden layers, 
respectively. The activation function in the hidden layers was logistic, in the output 
layer it was linear. Training was carried out by the method of stochastic gradient 
descent. Each network was trained 5 times with various weights initializations. 
Statistics of application of these 5 networks were averaged. 
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To prevent overtraining of neural networks, the method of early stop of the training 
was used. The initial data set was randomly divided into training, validation and test 
sets. Training was performed on the training data set; training was stopped by the 
minimum of the mean squared error on the validation set (after 1000 epochs without 
improving the result). Independent evaluation of the results was performed on the test 
(out-of-sample) set along with additional test sets with noise described in Sect. 3. 

In the case of data without noise, the number of patterns in the training, validation 
and test sets was 70, 20, 10% of the number of patterns in the original set. So, the 
training data set contained 3112 patterns, validation set — 889 patterns, test set — 444 
patterns. 

In the case of data containing noise, each patterns of these sets was presented in 10 
noise implementations. So, the size of the sets was: training — 31120 patterns, test — 
4440 patterns each. The validation set was left unchanged (see Sect. 4.4). 

When using training with noise, neural networks trained with noise were applied to 
test sets with the same noise type and noise statistics. In the case of noise-free training, 
neural networks were applied to all test sets. 


4.2 Selection of Input Features 


To reduce the input dimension of the problem, a priori knowledge about the object was 
used. The input of the neural network was fed with the features representing the 
spectrum intensities in the channels lying in the intervals 960-1143, 1312-1690, 3014— 
3601 cm ', which correspond to the most informative parts of the spectrum: the 
valence band of water and the characteristic lines of complex ions. Thus, the input 
dimension of the problem was reduced almost three-fold and amounted to 664 features. 


4.3 Method of Group Determination of Parameters 
The following ways of parameter determination were considered: 


e Autonomous determination — for each ion, a separate single-output MLP was 
trained. 

e Simultaneous determination — with a single neural network with 10 outputs. 

e Group determination — using the following grouping principles: simple ions, 
complex ions, cations, anions. So, from 4 to 6 parameters were determined 
simultaneously using a neural network with the corresponding number of outputs. 


Each of the listed methods of parameter determination was presented both inde- 
pendently and in conjunction with the method of training with noise. 


4.4 Method of Training with Noise 


This method was implemented by using training data sets containing a certain level of 
noise. In this case it is possible to abandon the validation data set, because the addition 
of noise is in itself a method of preventing overtraining [12-14]. In the case where the 
validation set is used, it must contain noise with the same noise type and noise statistics 
and with the same noise level as in the training set. It is possible to use a validation set 
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that does not contain noise. In [17] it was shown that the optimal method of training 
was when training was performed on a training set, which contained noise, and the 
training was stopped on a validation set without noise. With this method, the quality of 
the solution was higher, and the training time was lower. 

This method was the one used in the present study. 


5 Results 


Figure 2 shows the dependence of the solution quality (mean absolute error, MAE) for 
the original solution, separately for the method of training with noise and for the 
method of group determination of parameters, on the noise level in the test set. It can be 
seen that the resilience of the solution to noise in the data is higher for multiplicative 
noise than for additive noise and higher for uniform noise than for Gaussian noise. 
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Fig. 2. The dependence of the quality of the solution (MAE) for the C1 ion on the noise level in 
the test set for different noise types and noise statistics. Red lines represent the original solution 
(no noise and no grouping), other line colors represent the method of adding noise to the training 
patterns; markers show the results of group determination. (Color figure online) 


For the method of adding noise during training, it can be seen that the higher is the 
noise level in the training data set, the worse the network performs on data without 
noise, but the slower it degrades with increasing noise level. For other ions under 
consideration, the nature of the dependencies is completely similar. 
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Fig. 3. The dependence of the solution quality (MAE) on the noise level in the test set for the K* 
ion for additive Gaussian noise. Lines — only method of adding noise during training, markers — 
joint use of both methods. Various graphs correspond to various noise levels added to the training 
set. 
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Fig. 4. The dependence of the solution quality (MAE) on the noise level in the training set for 
the NH4* ion for additive uniform noise. Lines — only method of adding noise during training, 
markers — joint use of both methods. Various graphs correspond to various noise levels contained 
in the test sets. 
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It can be also seen that group determination of parameters in some cases allows 
increasing the resilience of the solution. The higher is the noise level in the test set, the 
higher is the effect of using this method. However, the method itself performs sig- 
nificantly worse than the method of training with noise. 

Figure 3 shows that the combined use of group determination and training with 
noise can improve the resilience of the solution relative to use of only the method of 
training with noise. As in the case of group determination, this approach has a more 
pronounced effect at high noise levels in the test set. It can also be noted that the lower 
the noise level in the training set, the more noticeable is the effect. 

Figure 4 shows that the highest quality of the solution for the method of training 
with noise only is observed when the noise level in the training and test sets are the 
same. It can also be seen that the effect of the joint application of methods is mostly 
observed at high test set noise levels. 


6 Conclusions 


Thus, in this paper the efficiency of the methods used was confirmed, and the following 
conclusions were obtained: 


e For the inverse problem of Raman spectroscopy for all the approaches used, the 
resilience of the solution to the noise in the input data is higher for the multiplicative 
noise type than for the additive noise type, and higher for the uniform noise dis- 
tribution than for the Gaussian noise distribution. 

e When using the method of adding noise during MLP training, the higher is the noise 
level in the training data set, the worse the network performs on the data without 
noise, but the slower it degrades with increasing noise level in the test set. 

e Joint use of the methods of group determination and of training with noise improves 
the resilience of the solution compared to use of only the method of training with 
noise. 

e When using group determination, both in the case of joint use with the method of 
training with noise, and in the case of individual use, the effect is more pronounced 
at high noise levels in the test set. 

The lower is the noise level in the training set, the more noticeable is the effect. 
The highest quality of the solution for the method of training with noise is observed 
when the noise level in the training and test sets is the same. 


Since the methods of group determination of parameters and of training with noise 
were successfully used also for solving other inverse problems, it can be concluded that 
the observed effects are a property of the perceptron as a data processing algorithm, and 
they are not determined by the properties of the data. 
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Abstract. Generative question answering systems aim at generating 
more contentful responses and more natural answers. Existing genera- 
tive question answering systems applied to knowledge grounded conver- 
sation generate natural answers either with a knowledge base or with 
raw text. Nevertheless, performance of their methods is often affected 
by the incompleteness of the KB or text facts. In this paper, we propose 
an end-to-end generative question answering model. We make use of 
unstructured text and structured KBs to establish an universal schema 
as a large external facts library. Each words of a natural answer are 
dynamically predicted from the common vocabulary and retrieved from 
the corresponding external facts. And our model can generate natu- 
ral answer containing arbitrary number of knowledge entities through 
selecting from multiple relevant external facts by the dynamic knowl- 
edge enquirer. Finally, empirical study shows that our model is efficient 
and outperforms baseline methods significantly in terms of automatic 
evaluation and human evaluation. 


Keywords: Natural answers - Universal schema 
Sequence-to-sequence learning 


1 Introduction 


Recent neural models of dialogue generation such as sequence to sequence model 
can be trained in an end-to-end and completely data-driven fashion. However, 
these fully data-driven models tend to generate safe responses that are bor- 
ing and carry little information. In other words, these models can not have 
access to any external knowledge, which makes it difficult to respond substan- 
tively. From another perspective, we can consider generative question answering 
as a special case of knowledge grounded conversation. As the examples shown 
in Tablel, daily conversations generally depends on individual’s knowledge. 
Recently, some researchers proposed neural conversation model that can gen- 
erate natural answers and knowledge-grounded responses either with knowledge 
base or with raw text. 
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Table 1. Examples of training instances for our model. The natural answer containing 
mutil-number of knowledge entities is generated based on both Knowledge Bases and 
Text. 


KB fact (Peking University, President, Yan Fu ) 


text fact Yan Fu is the first President of Peking University. 


User A: Who was the first President of Peking University? 
UserB: The first President is Yan Fu 


KB fact (The Journey to the West, author, Wu Chenen) 
(The Journey to the West, written time,Ming dynasty) 


text fact Wu Chenen who is the author of The Journey to the West 
was an outstanding novelist of the Ming dynasty. 


User A: Who is the author of "journey to the west”. 
UserB: It was wu chengen in the Ming dynasty. 


In order to generate more contentful responses, more and more generative 
question answering systems and knowledge-grounded conversation model are 
proposed. On one hand, Ghazvininejad et al. [4] utilized external textual infor- 
mation as the unstructured knowledge. They found that unstructured knowledge 
can make a response more contentful. On the other hand, Yin et al. [18], He et 
al. [6] and Zhu et al. [20] have proposed generative question answering (QA) 
model that can generate natural answers by entities retrieved from the KB and 
seq2seq model [15]. But, the performance of model above are often affected by 
the incompleteness of the KB or text. How to generate more contentful resposes 
or natural answer by exploiting KB and text together is necessary to study. 

In this paper, we propose our neural generative dialogue model, which can 
generate responses based on input message and external facts. For the first time, 
we propose our approach that combined text and KB library as our external facts 
by building the universal schema [10] to generate natural answer. In each time 
step of generating the natural answer, the possible word may come from common 
word vocabulary or knowledge entity vocabulary and the natural answer that 
contains the relevant arbitrary number of entities can be generated. Finally we 
conduct experiments on real-world datasets. Experimental results demonstrate 
that combining unstructured knowledge with structured knowledge is effective 
for generating natural answer, and our model is more efficient than the existing 
end-to-end QA/Dialogue model. 


2 Related Work 


Recently, sequence-to-sequence [7, 15] learning, which can predict target sequence 
given source sequence, has been widely applied in dialogue systems. Shang 
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et al. [14] first utilized the encoder and decoder framework to generate responses 
on micro-blogging websites. And after that, more and more dialogue system 
[12,13,16] on the basis of seq2seq framework were proposed. In our work, our 
model is also based on seq2seq framework and we try to combine the external 
facts composed of KB and text to generate more contentful responses. 

Many researchers propose open domain dialogue system which can incorpo- 
rate external knowledge to enhance reply generation. Han et al. [5] proposed a 
rule-based dialogue system by filling the response templates with retrieved KB. 
Ghazvininejad et al. [4] utilized external textual information as the unstructured 
knowledge. As demonstrated, the external textual information can convey more 
relevant information to responses. Some recent work used external structured 
knowledge graph to build end-to-end question answering systems. Yin et al. [18] 
proposed a seq2seq-based model where answers were generated in two ways, 
where one was based on a language model and the other was by some entities 
retrieved from the KB. He et al. [6] and [20] further studied the cases where 
questions require multiple facts and out-of-vocabulary entities. 

In older to improve the performance of knowledge base QA model. Das et al. 
[3] extend universal schema to natural language question answering, employing 
memory networks to attend to the large amount of facts in the combination 
of text and KB. Inspired by them, we also have built the universal schema 
to combine KB and text and tried to employ a key-value MemNN model as 
our knowledge enquirer. But different from them, our model can generate more 
natural answer, rather than a single entity. Other work such as [17,19], also put 
forward some models to exploit KB and the text together, but their formulations 
are totally different from ours. 


3 Our Framework 


3.1 Framework Overview 


In real-world environments, people prefer to reply one’s question with a more 
natural way. Jsut like the example shown in Table 1, When user A asks “Who is 
the first President of Peking University?”, user B should answer: “The first Pres- 
ident is Yan Fu” rather than only one entity or an answer that is not relevant to 
the question. For the above natural language question-answering scenario, in our 
work,the problem can be defined as: given an input message Q = (21,3, ..., UL), 
the problem is to generate an appropriate response Y = (y1, y2,.--, YL) based all 
possible facts form text and KB. And in order to try to solve the above prob- 
lems, we propose an end-to-end generative question answering system, which is 
illustrated in Fig. 1. 


3.2 Candidate Facts Retriever 


The candidate facts retriever identifies facts that are related to the input mes- 
sage. In our work, the model retrieve the relevant text facts by firstly finding the 
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(b) Predicting & Enquiring(from KB) 
P(Yan Fu) = P,(Yan Fu) + P¿(Yan Fu) 
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Fig. 1. The overview of our model. Our model consists of message encoder, candidate 
facts retriever, reply decoder, and universal schema containing external facts. When the 
user inputs a question, the knowledge retrieval module is firstly employed to retrieve 
related facts. And then message encoder encode the problem into hidden states. Finally, 
hidden state from message encoder are feed to reply decoder for generating natural 
answer. 


relevant KB triples (subject-property-object) from the universal schema. Specif- 
ically, We denote the entities of Q by E = (e1,€2,...,€m). E can be identified 
by keyword matching, or detected by more advanced methods such as entity 
linking or named entity recognition. Based on detected triples, we can retrieve 
the relevant facts from universal schema. Usually, question contains the informa- 
tion used to match the subject and property parts in a fact triple, and answer 
incorporates the object part information. 


3.3 Question Encoder 


In older to catch the user's intent and get hidden representations of input 
message. We employ a bidirectional GRU [2,11] to transform the message 
Q = (11,t2,..., ZL) into a sequence of concatenated hidden states with two 
independent GRU. Once a message is encoded by message encoder, the for- 
ward and backward GRU respectively obtain the hidden state{hy, ha, E hr} 


and hr, ae ha, hi}, where L is the maximum length of the message. The con- 
text memory of input message can be obtained by concatenated hidden state list 
Hg = {hi,...,ht,..., hi}, where h; is equal to [h ¿, h (1-¿+1)]. Besides, the last 
hidden state h; is used to represent the entire message. 
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3.4 Reply Decoder 


The reply decoder generates the final response Y based on the hidden represen- 
tations of input message Hg and candidate facts Fo that come from the uni- 
versal schema. There are two categories of possible words, the common words 
and knowledge words, in the generated response. Specifically, the probability of 
generating the answer: 


p(yı, Y2,- YLy |HQ, Fo; 0) — 
Ly 
p(yilHo, Fo; 0) ¡NCAA Y2, 4-1, HQ, Fo; 0) (1) 
t=2 
where 0 represents the parameters in the model. The generation probability of 
yt is specified by 


D(YilY1, Ya, > YLy> Ha, Fo; 0) = P(YelYe—1) 24, St; Ho, Fo; 0) (2) 


where s+ is the hidden state of the decoder model and z; € {0,1} is the value 
predicted by a binary classifier. In generating the tt? word y, in the answer, the 
probability is given by the following mixture model. 


p(yılyı, Ya, cu Us HQ, Pod) = De(YelZ = 0)p(z = Olyı-1,5:, Ha, Fo; 0) 
+pe(yel2 = 1)p(z = 1ye-1, St, HQ, Fo; 0) (3) 


Response Words Prediction Classifier. In order to generate the final 
response containing common words and knowledge words, we apply a MLP as 
a binary classifier and at each time step, feeding a time step s:_ı, Yı-ı, the 
MLP classifier outputs a predicted value z4 € {0,1}. If z = 0, it means that the 
next generation word is from the entity vocabulary and in our work, the entity 
vocabulary contains all the “object” of the KB triples. And conversely, if z¿ = 1, 
the next generation word is generated from common vocabulary. In summary, 
the y, is generated as: 


plytlys—1, 2t; Se, HQ, Fo; 0) = pel Ye) P(z = Olyı-1,5:, Ho, Fo; 9) 
+pe(yı)p(z = llye-1, se, Ho, FQ;0) (4) 


Universal Schema. To make full use of external facts from structured KBs and 
unstructured text, our external knowledge M comprise of both KB and text. And 
Inspired by Das et al. [3], we applied universal schema to integrate KB and text. 
Each cell of universal schema is in the form of key-value pair. Specifically, let 
(s,r,o) € K represent a KB triple, the key k is represented by concatenating the 
embeddings s and r and the object entity o is treated as it’s value v. For text, Let 
(s, [w1,..., entity, ..., entitya, wn],o) € T represent a textual fact, where entity, 
and entity correspond to the positions of the entities subject and object. We 
represent the key as the sequence formed by replacing entity, with subject and 
entity with a special ‘blank’ token, i.e., k = [wı,...., s,..., blank, wn], which is 
converd to a distributed representation using a bidirectional GRU, and value as 
just the entity object o. 
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Knowledge Enquirer. We have chosen two implementations that have similar 
effect in our experiment as knowledge enquirer to calculate the matching scores 
between question and candidate facts. The first model is a two-layer MLP. The 
fact representation f is then defined as the concatenation of key and value. The 
list of all related facts’ representations, {f} = { f1, f2,..., fLp } (Lr denotes the 
maximum of candidate facts), is considered to be a short-term memory of the 
large body external knowledge memory M. We define the matching scores func- 
tion between question and facts as function is S(q,s;, f;) = DNN1(q, s+, fj) 
where s+ is the hidden state of decoder at time t and DNN1 is the two-layer 
MLP. In addition, we also adopt the key-value MemNN proposeed by Miller et 
al. [8] where each memory slot consists of a key and value. It is worth noting 
that, excepting for question and related facts, We also need to use state s; of 
decoding process as the input of the key-value MemNN because the matching 
results also depend on the state of decoding process at different times. 


Common Word Generator. To generate richer content and more matching 
answers to user questions, we applied a GRU model and attention mechanism 
to generate common words. Firstly, we calculate a message context vector c by 
using the attention mechanism [1] on the message hidden vectors H with the 
current generator hidden state s;_;. And then, the word of the next time step 
s; is obtained as s: = f(yt-1, 811, Ct). Finally, the predicted target word y: at 
time t is performed by a softmax classifier over a settled vocabulary (e.g. 40,000 
words) through function g:p(ytly<+, X) = g(Yt-1, St, Ct). 


State Update. In the generic decoding process, each hidden state s; is updated 
with the previous state s;_ı, the word embedding of previous predicted symbol 
Yt-1, and an optional context vector c; (with attention mechanism). However, 
Yyt-1 May not come from entity vocabulary and not owns a word vector. There- 
fore, we modify the state update process. More specifically, y,-1 will be repre- 
sented as [e(y¿_1),Cx,_,], Where e(y;_ı) is the word embedding associated with 
Yı-ı and Cx, , are the weighted sum of hidden states in Mp corresponding to 
Y-1- L 
F 1 P 
a Ge [gan object(f;) = y: a 
j=l 


0 otherwise 


where object(f) indicate the “object” part of fact f, and K are the normaliza- 
tion terms which equal >” jrobject(f}) Pol fil), which can consider the multiple 
positions matching y, in external facts. 


4 Experiments 


4.1 Dataset 


For our experimental data, we used the data set provided by He et al. [6]. In 
addition, we have crawled the corresponding text facts from Baidu baike (a 
Chinese encyclopedia website). In our work, all “subject” entities and “object” 
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entities of triples are used as encyclopedic items, and we crawl all article related 
to these encyclopedic items. The texts in Chinese in the data are converted into 
sequences of words using the Jieba Chinese word segmentor, then all related 
text facts were extracted through Keyword matching with KB triples. After 
extracting all the relevant facts from the article, we used the facts from text and 
KB to establish the universal schema. 


4.2 Model 


Firstly, we use seq2seq model with attention (seq2seq+atten) as one of our base- 
lines, which is widely used in chit-chat dialogues system. And then, we also use 
generative QA model (GenQA [18] and COREQA [6]) as our baselines, which can 
be applied in knowledge grounded conversation. Finally, We apply our model, 
and compared three types of external knowledge source which respectively com- 
prise of only KB, only textual and universal schema containing both text and 
KB. 


4.3 Evaluation Metrics 


We have compared our model with baselines by both automatic evaluation and 
human evaluation. 


Automatic Evaluation. Following the existing works, we employ the BLEU 
[9] automatic evaluation, which reflects the words occurrence between the ground 
truth and the generated response. And to measure the information correctness, 
we evaluate the performance of the models in terms of accuracy. Meanwhile 
(same as COREQA [6]) we separately present the results according to the num- 
ber of the facts which a question needs in knowledge base, including just one 
single fact (marked as Single), multiple facts (marked as Multi) and all (marked 
as Mixed). In our work, we randomly selected 5120 samples from data set as our 
test set, and the result is shown in Table 2. 


Human Evaluation. We also recruit human annotators to judge the quality of 
the generated responses with aspects of Fluency, Correctness and grammar. All 
scores range from 1 to 5. Higher score represents better performance in terms 
of the above three metrics. In older to provide human evaluation, we randomly 
selected 300 samples from our test set, and the result is shown in Table 3. 


4.4 Results 


Table 2 shows the accuracies of the models on the test set. We can clearly observe 
that our model significantly outperforms all other baseline models and our model 
can generate correct answer that need single fact or multiple facts. This also 
proves that using KB and text as an external knowledge is helpful for generating 
more accurate natural answers and generating contentful responses. 
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Table 2. The result of automatic evaluation on test data. 


Models BLEU | Single Multi | Mixed 
seq2seq+atten 0.39 20.1 | 3.5 19.4 
GenQA 0.38 (47.2 | 28.9 | 45.1 
COREQA - 58.4 42.7 | 56.6 
Our modelk» 0.42 56.2 | 45.9 | 54.7 
Our modeltext 0.45 |47.2 |42.9 | 45.9 
Our modekezt&r» | 0.43 65.4 | 52.7 | 63.6 


Table 3. The result of human evaluation on test data. 


Models Fluency | Correctness | Grammar 
seq2seq+atten 3.67 2.34 3.93 
GenQA 3.56 3.39 3.73 
Our modeliertekb | 4.12 4.42 4.19 


As illustrated in Table 3, the results show that our framework outperforms 
other baseline models. The most significant improvement is from correctness, 
indicating that our model can generate more accurate answer. 


5 Conclusion and Future Work 


In this paper, we propose an end-to-end generative question answering system to 
generate natural answers containing arbitrary number of knowledge entities. We 
establish an universal schema as large external fact library using unstructured 
text and structured KB. The experimental results show that our model can gen- 
erate more natural and fluent answers and universal schema is a promising knowl- 
edge source for generating natural answer than using KB or text alone. However, 
after extracting related text facts from raw text through keyword matching with 
KB triples, a lot of useful text data also were discarded. In the future, we plan to 
explore ways to more effectively combine structured and unstructured knowledge 
with a fuller use of text. 
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Abstract. Learning in non-stationary environments is challenging, 
because under such conditions the common assumption of independent 
and identically distributed data does not hold; when concept drift is 
present it necessitates continuous system updates. In recent years, sev- 
eral powerful approaches have been proposed. However, these models 
typically classify any input, regardless of their confidence in the classi- 
fication — a strategy, which is not optimal, particularly in safety-critical 
environments where alternatives to a (possibly unclear) decision exist, 
such as additional tests or a short delay of the decision. Formally speak- 
ing, this alternative corresponds to classification with rejection, a strat- 
egy which seems particularly promising in the context of concept drift, 
i.e. the occurrence of situations where the current model is wrong due 
to a concept change. In this contribution, we propose to extend learn- 
ing under concept drift with rejection. Specifically, we extend two recent 
learning architectures for drift, the self-adjusting memory architecture 
(SAM-kNN) and adaptive random forests (ARF), to incorporate a reject 
option, resulting in highly competitive state-of-the-art technologies. We 
evaluate their performance in learning scenarios with different types of 
drift. 


Keywords: Rejection - Reject option 
Learning in non-stationary environments - Concept drift 


1 Introduction 


Machine learning (ML) increasingly permeates our daily lives in the form of 
intelligent household devices, robot companions, autonomous driving, intelligent 
decision support systems, fraud prevention, etc. Although ML models are getting 
ever more reliable — in particular due to increasing data volumes for training — 
they do not achieve 100% accuracy since they rely on statistical inference. Usu- 
ally, there exist situations where ML models fail and provide invalid results. 


This work was supported by Honda Research Institute Europe GmbH, Offenbach 
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Because users of a model struggle to interpret its abilities and limitations cor- 
rectly [1], such failures have a measurable impact on the user’s trust [2] — hence, 
failures should be avoided not only in safety critical environments where failures 
could be fatal, but also in everyday applications in order to improve user accep- 
tance. In the case of agent models (e.g. robots), failures can often be observed 
easily from the agent’s state (e.g. a robot not reaching its prescribed goal), and 
the challenge is how to communicate the cause of failure [3]. In stark contrast, 
failures can remain unobserved for classification models since most classifiers 
do not provide an explicit notion of their domain of validity. Hence the chal- 
lenge arises how to enhance classifiers with an explicit notion when to reject a 
classification. 

The notion of classification with a reject option explicitly takes into account 
the possibility to reject a classification in unclear cases. Pioneered by Chow [4], 
who derived optimal reject rules if true class probabilities are known, a num- 
ber of extensions of learning with reject options have been proposed for batch 
learning scenarios, such as plugin rules for class probabilities [5], efficient sur- 
rogate losses [6,7], or optimal combination schemes of local rejection [8]. These 
approaches deal with the classical setting of batch training based on i.i.d. data. 
A minor extension is offered by so-called conformal prediction, a framework 
which allows to assign probabilities to classification decisions for single inputs, 
and, consequently, to reject classification based on those values [9]. Here, the 
weaker condition of exchangeability is posed, opening the floor to online learn- 
ing scenarios, but not yet to concept drift [10]. 

A number of approaches have been proposed for learning in non-stationary 
environments in the presence of concept drift, whereby several recent technolo- 
gies are also suited for heterogeneous types of drift [10-14]. Generally speaking, 
concept drift is present whenever the underlying input distribution or class pos- 
terior changes, which is the case when sensors are subject to fatigue, novel and 
previously unseen data is observed over time, class concepts such as opinions 
develop over time, settings are subject to seasonal changes, etc. When learning 
with drift, it is almost inevitable to encounter domains of uncertain classifica- 
tion — otherwise, it would not be necessary to further adapt the classification 
mapping, contradicting the idea of drift. Nevertheless, most learning models for 
non-stationary environments do not incorporate reject options. The only notable 
exception is the Droplets algorithm [15], which assigns some inputs explicitly to 
the class “reject”; size and shape of this class depend on (fixed) model meta- 
parameters for training. A scalable reject threshold based on the required level 
of certainty or user acceptance is not induced by this model. 

In this contribution, we aim for an enhancement of models for learning with 
drift by a reject option which is based on a classifier-specific certainty measure 
of the classification. To the best of our knowledge, this contribution constitutes 
the first attempt to extend learning with drift to include rejection in such a 
way. The overall design implies that a suitable reject threshold can be cho- 
sen in applications. We investigate rejection for an online perceptron learning 
algorithm, demonstrating the complexity of the task. Afterwards, we propose 
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a reject option for two techniques, the self-adjusting memory model and adap- 
tive random forests, achieving convincing results. We demonstrate the benefit 
of learning with rejection in a couple of benchmarks which incorporate different 
types of drift. 


2 Learning with a Reject Option 


A given classifier provides a mapping f: R” — {1,..., N} of real-valued data to 
N classes. Classification with reject option extends such functionality by a special 
output class o, which indicates that the classifier abstains from making a decision. 
This option is beneficial whenever the probability of a misclassification is higher 
than the costs for a reject. In practice, many classifiers are equipped with a 
certainty measure c: R” — R which indicates the certainty of the classification, 
e.g. the (signed) distance to the decision boundary. In such cases, a reject strategy 
is often based on a simple threshold 6, i.e. the classification is of the form 

fol) = i a (1) 


o otherwise. 


Provided c(x) is the class probability of the output class f(x), this strategy is 
optimal [4]. For many popular classification methods, certainty measures c exist 
which empirically lead to excellent results [8]. 


2.1 Classifiers 


In addition to a linear model as an initial baseline, we address an ensemble 
of k-NN classifiers and random forests, respectively — more complex machine 
learning technologies that yield state-of-the-art results. For these algorithms, 
the following certainty measures have been proposed: 


Linear Classifier: One of the first models which has been enhanced with a reject 
option is the classical linear classifier. For two classes (0 and 1), a linear classifier 
provides the classification f(x) = H(w'! x — 0) with the Heaviside function H, 
an adjustable weight vector w € R” and bias 6. A typical confidence measure 
is offered by c(x) = sgd(w' x — 0) with the sigmoidal sgd(t) = 1/(1 + exp(—t)) 
for class 1 and 1 — sgd(w' x — 0) for class 0. This measure correlates to the 
distance of the data point x to the decision boundary. It has been demonstrated 
by Platt [16] that this form usually yields reasonable confidence measures, where 
— typically - slope and offset of the sigmoidal function are optimized based on the 
given data to enable an optimum match of its range to true confidence values. 


k-NN classifier: Assume a point x is given with its k nearest neighbors 11,..., Uk 
and corresponding labels y;,..., yx. For the simple k-NN we could rely on the 
fraction of points of the same label within the k nearest neighbors [17]. How- 
ever, this measure has the drawback that it provides k + 1 discrete values only. 


Mitigating Concept Drift via Rejection 459 


A continuous extension can be based on formal grounds such as Dempster-Shafer 
theory [18], but this would require the tuning of several meta-parameters, ren- 
dering this measure unsuitable for online learning. Here, we rely on weighted 
k-NN classification instead: 


F(a) = asas Y pl ei (2) 


where d(a,x;) is the (euclidean) distance! between x and x;, and 


- 1, Yi = J, 
Ilyi j) = 3 
(Yi, j) er (3) 
Delany et al. [19] investigate several certainty measures and propose an accu- 
mulation of several criteria that take into account distances to closest neighbors 
of the same class and different classes, respectively. We approximate this value by 
an efficient surrogate function which can be directly derived from the weighted 
k-NN classification rule, the normalized average distance with values in [0, 1]: 


LA) Dats: 


j=l i=l 


Random forests: Random forests as introduced by Breiman [20] constitute one of 
the current state-of-the-art classifiers [21], offering a classification as an ensemble 
of decision trees. Typically, decision trees are grown iteratively from the training 
data (or bootstrap samples thereof in the case of random forests), and every 
leaf is assigned a class probability distribution in terms of the relative frequency 
of the labels of the training samples assigned to this leaf. This probability can 
directly be interpreted as a certainty measure, but it is subject to large variance 
for single trees. This is greatly diminished when averaging over a bootstrap 
sample, as present in random forests. It has been investigated experimentally by 
Niculescu-Mizil and Caruana [22] that the resulting values strongly correlate to 
the true underlying class probabilities, hence we will use this certainty measure 
in the case of random forests. Its values lie within the range [0,1]. 


2.2 Evaluation Measure 


Based on the underlying class probabilities, one could obtain optimal reject 
strategies, but they are not known in practical applications. Good certainty mea- 
sures typically strongly correlate with said probabilities, although their precise 
values differ [22]. An optimal choice of the threshold is often problem-dependent, 
reflecting the desired balance of the number of rejected data points versus the 
accuracy for the remaining data. As such, it is common practice to compare 


1 We subsitute a small e > 0 for d(x, a) if d(x, ai) < e. 
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the efficiency of classification with a reject option by a comparison of the so- 
called accuracy-reject curve: Sampling certainty thresholds 6 € [0,1], we report 
the accuracy of the classification method for all points that are not rejected 
(i.e. accepted) using this threshold, together with the ratio of points that are 
accepted [23]. 


3 Learning with Concept Drift and Its Extension 
to Rejection 


In online learning, a potentially infinite stream (..., (£t, yz), (141, Yt+1);---) of 
training data is given, where t denotes the current time, and each sample (x+, y+) 
is generated from an unknown probability distribution p+. The presence of drift 
refers to the fact that pı(x,y) changes over time, i.e. at least two time points 
tı and ta exist such that p: (x,y) 4 Pt (X, y). If the posterior class probabilities 
change, pz, (yle) 4 pe (ylz), we call this real concept drift; if only the input 
distribution changes, p: (1) 4 p(x), this is referred to as virtual concept drift 
or covariate shift. In particular for real concept drift, a static classifier is often 
suboptimal, and the goal is to evolve a classification mapping h; over time, which 
adjusts to the current class posterior distribution, whereby h;+, is inferred from 
h: and the current sample (x+, yz) only. The objective is to minimize the average 
misclassification over time as measured, for example, by the so-called interleaved 
test-train error for a time period T' 


T 
pay leo.) a (5) 
t=1 


This setting can be extended to online learning with rejection as soon as 
the classification mapping fẹ is accompanied by a certainty measure c+. In this 
case, given a threshold 6, classification at time point t is rejected if, and only if, 
cı(x) < 6. Evaluation takes place by reporting the modified interleaved test-train 


error 
If (21), ye) 
En = 
9 sep ass HE<T : cas) > 9}| (6) 


and the ratio of classified data points 


jt <T: alx) > Ol 


A (7) 


A number of learning models have been proposed which are capable of dealing 
with drift [10-14]. We address two recent models (SAM and ARF) which are 
suited for heterogeneous drift and which can be naturally extended to include 
a reject option. For comparison, we look at a linear classifier (perceptron) that 
can adapt to drift but where useful reject strategies are problematic, as well as 
two sliding windows to serve as a baseline. 
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Online perceptron: One simple — yet popular — method, which is also avail- 
able in stream mining suites such as the massive online analysis toolbox for 
data streams, is online perceptron learning [24]. Essentially, this consists of an 
online gradient descent of the squared error of the perceptron activation function 
sed(w! a — 0) based on given data with fixed step size. This model is naturally 
restricted to linear settings, yet it yields surprisingly accurate behavior in an ini- 
tial demonstration scenario as we will show in an experiment, a behavior which 
has also been substantiated analytically [25]. Yet, for online settings, it is not 
possible to adjust the sigmoidal rescaling of the perceptron output as proposed 
by [16], hence we will directly rely on the measure c(x) for rejection as introduced 
above. 


SAM-kNN: The Self-Adjusting Memory (SAM) architecture [26] keeps two com- 
plementary memories — short-term and long-term. The former contains the most 
recent samples of a data stream, whereby the length of this window is adjusted 
based on the classification performance, while the latter stores and continuously 
refines a compacted representation of previous samples as long as these are con- 
sistent with the short term memory. Depending on how the data stream changes, 
SAM makes flexible use of its two memories and a weighted k-nearest neighbors 
classifier to accurately classify even when drift is present. We extend the output 
of the classifier by the certainty measure as introduced above as the basis for a 
reject option. 


ARF: Adaptive random forests (ARF) [14] constitute a state-of-the-art ensem- 
ble method for learning with drift. Random forests grow very fast decision trees 
(Hoeffding trees) online based on Poisson sampling to mimic bootstrapping 
effects. ARF wraps this technology into an active drift detection loop, which 
assigns suitable weights to an ensemble of trees, replaces unsuitable trees if drift 
is observed, and grows trees in the background that can serve as an intelligent 
initialization of such replacements when drift is expected. We can use the cer- 
tainty measure as introduced for random forests above and extend it to weighted 
averages over the ensemble of trees as a basis for rejection. 


Sliding window: Techniques which use a classifier based on a sliding window 
of the data stream can serve as a baseline. We will consider a weighted k-NN 
classifier with a sliding window of fixed size (referred to as fixed window) as well 
as a window whose size is adapted based on the optimum classification error 
such as the short term memory in SAM (referred to as adaptive window). 


4 Experiments 


4.1 Linear Setting 


Initially, we investigate how a perceptron’s certainty responds to concept drift 
and demonstrate that it is not easily augmented with a reject option. To that 
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Fig. 1. Accuracy and reject ratio over time according to different reject thresholds for 
the perceptron and the fixed window. Thresholds are chosen such that 100%, 90% and 
80% of points are accepted. Accuracy and ratio are calculated over a sliding window 
that contains 10% of the dataset's total number of samples. Between samples 37500 
to 62500 the data-generating rectangles move through one another, making the classi- 
fication more difficult. 


end, we create a 2- dimensional dataset with two classes. Points are sampled uni- 
formly from two rectangles (which determine the class label) that move towards, 
through, and apart from one another over time. The two classes are initially lin- 
early separable, then become indistinguishable, and eventually become linearly 
separable again — albeit with a flipped separating hyperplane. To add noise, we 
flip the class label of every 7th sample. 

For comparison, the data is used to evaluate a fixed window? as well as the 
perceptron. The results are presented in Fig. 1, with different certainty thresh- 
olds that correspond to 100% 90% and 80% of accepted (classified) points. It 
is apparent that the simple online perceptron is surprisingly accurate for this 
data set, despite its rather simple learning rule. As expected, an increasing clas- 
sification difficulty is reflected in a decrease in accuracy. When the rectangles 
move apart (and the classes become linearly separable again), both algorithms 
recover. However, it is apparent that the perceptron hardly benefits from a reject 
option, whereas the fixed window clearly does, rejecting more points the more 
difficult the problem is and in such a way that the accuracy increases. Hence, it 
is a nontrivial task to identify effective rejection for learning with drift. 


4.2 General Setting 


We evaluate the efficiency of classification with rejection on a number of bench- 
mark datasets with nonlinear characteristics. Here, model meta-paramaters are 


? The fixed window serves as a straight-forward example. Results for the adaptive 
window, SAM, and ARF are comparable — the largest difference in accuracy between 
all four is below 2%. 
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chosen in the same way as reported in Losing et al. [26] and Gomes et al. [14]. We 
determine accuracy reject curves by dividing the range of observed certainties 
into equally sized intervals and deriving the respective pareto-optimal accuracy- 
reject pairs. For reporting, we focus on the practically interesting range of 100% 
to 50%. We consider the benchmark datasets as described in Losing et al. [12,26], 
since they cover a wide variety of different data and drift characteristics. See 
Table 1 for an overview over the datasets. 


Table 1. Datasets considered for our experiments. Real-world datasets are followed by 
artificial datasets — other than that, they are presented in no particular order. Drift 
properties are given according to Losing et al. [12]. 


# Samples | # Features # Classes | Drift 
Outdoor objects 4000 21 40 Virtual 
Rialto bridge 82250 27 10 Virtual 
Poker hand 829201 10 10 Virtual 
Blectricity 45312 6 2 Real 
Weather 18159 8 2 Virtual 
Transient chessboard | 200000 2 8 Virtual 
Rotating hyperplane | 200000 10 2 Real 
Interchanging RBF | 200000 2 15 Real 
Mixed drift 600000 2 15 Real 
Moving RBF 200000 10 5 Real 
Moving squares 200000 2 4 Real 
SEA concepts 50000 3 2 Real 


Effectiveness of Reject. The resulting accuracy-recject curves with respect 
to different certainty thresholds for all twelve datasets and all four classifiers 
are presented in Fig.2. As reported by Losing et al. [26] and Gomes et al. [14], 
it is apparent that the methods SAM and ARF are robust classifiers capable 
of dealing with drift, with SAM performing consistently well across all datasets 
considered, while ARF shows excellent results in most, but not all (in particu- 
lar Outdoor Objects, Moving RBF, and Moving Squares). Surprisingly, also the 
baselines yield acceptable results for certain datasets. We observe that rejection 
increases the classifiers’ accuracy consistently for all datasets, and influence all 
methods similarly: Averaged over all datasets and all four classifiers, rejecting 
10% or 20% of all samples leads to an increase in accuracy by 3.19% or 5.64%, 
respectively. The smallest increase is 1.06% or 1.17%, the highest increase is 
5.43% or 10.07%. 

At present, we have used certainty measures that are intuitive and fast to 
compute in all cases. The curves indicate one possible weakness of these mea- 
sures: in particular for k-NN classifiers (including SAM), the accuracy does not 
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Fig. 2. Accuracy-reject curves for all datasets considered. Note the different vertical 
axes. The nearest neighbor based classifiers classify many samples with maximal cer- 
tainty, which explains why the respective curves often terminate early. 


reach 100% — rather, the curves end prematurely. This is due to the fact that 
k-NN assigns a certainty of 100% to a great number of points since their k- 
neighborhoods are uniformly labeled. More elaborate certainty measures such as 
a reject option based on absolute distances, that respects outliers, or an exten- 
sion of the method to ensembles and according averaged certainties, could enable 
a “subtler” assessment of certainty. Hence, we see room for further improvement 
beyond the already satisfactory results. 


Temporal Behavior. As expected, accuracy varies over time in the presence 
of non-homogeneous drift in real-life datasets. For Outdoor Objects and Rialto 
Bridge we show this together with how accuracy is affected by rejection, and 
how the rejection ratio varies over time, when SAM with a reject option is 
used to classify (Fig. 3). Interestingly, the sharp drop in accuracy at samples 
1700 to 2200 from Outdoor Objects is mirrored in the ratio of rejected points 
as a pronounced peak in rejected points. In this case, increasing the number of 
rejected points allows the classifier to improve so much that no notable drop in 
accuracy remains. 
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Fig. 3. Accuracy and reject ratio over time according to different reject thresholds 
for SAM, shown for the datasets Outdoor Objects and Rialto Bridge. Thresholds are 
chosen such that 100%, 90% and 80% of points are accepted. Accuracy and ratio are 
calculated over a sliding window that contains 10% of the respective dataset’s total 
number of samples. Note the abrupt, temporary drop in accuracy for Outdoor Objects 
and the corresponding increase in the number of rejected points for samples 1700 to 
2200. 


A similar — albeit less pronounced — behavior can be observed for Rialto 
Bridge. Here the overall variation in accuracy becomes much narrower. For sam- 
ples 30 000 to 40 000 the abrupt loss in accuracy is compensated by more rejected 
points. 


5 Discussion 


We have introduced and evaluated diverse online learning classifiers with reject 
options in the presence of concept drift. Across all datasets and classifiers, we 
see a notable increase in accuracy when using a reject option for k-NN classifiers 
and ensembles of random forests. In stark contrast, rejection as presented for the 
perceptron do not seem easily extendable to the setting of concept drift; within 
an initial linear setting, rejection did not show any benefits. As expected, also 
for the real life non-linear data sets, no classifier achieves 100% accuracy in the 
presence of drift. Interestingly, although techniques such SAM-KNN are consis- 
tently good for all settings, there is not one clear winner among the classifiers 
as they perform differently on various datasets. This is in line with the findings 
of Losing et al. [26]. 

Rejecting with respect to a fixed certainty threshold does not merely increase 
the accuracy overall but can specifically alleviate low accuracy that stems from 
low certainty, as seen in Fig. 3. It remains to be seen how more sophisticated, 
time- and drift-dependent strategies for dynamically choosing certainty thresh- 
olds can improve performance even further. 

Considering the particular structure of SAM, where classification depends 
on a choice between long- and short-term memory, it might prove beneficial to 
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incorporate their certainties into the decision-making process — so far, it has 
depended solely on the memories’ past performances. One must carefully inves- 
tigate, however, how a classifier’s certainty can be trusted, especially when the 
classifier performs badly in the presence of drift. On the other hand, samples with 
low certainty could indicate areas in which the model needs to be augmented. 

As mentioned earlier, incorrect classification results can negatively impact a 
user’s trust in a system. Because it leads to a higher accuracy, rejection alleviates 
these issues, but it will further be important how to communicate to a user why 
a point is rejected or — more generally — with how high a certainty a point is 
classified and how that certainty is to be interpreted. 
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Abstract. Some strategies used by the insect olfactory system to enhace 
its discrimination capability are an heterogeneous neural threshold dis- 
tribution, gain control and sparse activity. To test the influence of these 
mechanisms on the performance for a classification task, we propose 
a neural network based on the insect olfactory system. In this model, 
we introduce a regulation term to control de activity of neurons and 
a structured connectivity between antennal lobe and mushroom body 
based on recent findings in Drosophila that differs from the classi- 
cal stochastic approach. Results show that the model achieves better 
results for high sparseness and low connectivity between Kenyon cells 
and projection neurons. For this configuration, the use of gain control 
further improves performance. The structured connectivity model pro- 
posed is able to achieve the same discrimination capacity without using 
gain control or activiy regulation techniques, which opens up interesting 
possibilities. 


Keywords: Neural computation - Pattern recognition 
Bio-inspired neural networks - Neural threshold - Sparse coding 
Olfactory system 


1 Introduction 


In this paper, we are going to focus on different strategies that are used in the 
olfactory system of insects for stimuli recognition and how they can be applied 
to improve the performance of artificial neural networks. Insect olfactory system 
is one of the most studied biological neural networks since it is less complex 
than the vertebrates olfactory system, so a lot of details about its structure 
and the function of different neural populations are known [7]. The insect olfac- 
tory system is organized in layers as follows: olfactory receptor neurons (ORNs) 
expressing different receptors capture the information of odorants and pass it to 
the projection neurons (PNs) in the antennal lobe (AL). PNs encode odorant 
information through oscillations and activity sequences which are then sent to 
the Kenyon cells (KCs) in the mushroom body (MB). It is in the MB where 
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stimuli identification and learning take place. KCs make use of sparse coding to 
improve the separability of patterns, showing little activity and no spike response 
for most of odorants [5,11]. Finally, the output from KCs goes to the MB output 
neurons (MBONs), responsible for the final identification of odorants. To achieve 
such a great discrimination capability between stimuli, some strategies that the 
biological system uses and that our model includes are the following: 


i. Heterogeneous threshold distribution in KCs: it has been found that an 
heterogeneous neural thresholds among KCs enhances pattern classification 
compared to the same model using an homogeneous distribution [8-10]. 

ii. Gain control: a mechanism for gain control at the AL level is crucial to pro- 
vide a standard representation of the stimuli regardless of its concentration, 
even when it is extremely high [14]. 

iii. Sparse coding in the MB: as stated before, KCs show low activity since 
each one of them only responds to a little set of stimuli [11]. This sparse 
activity in combination with the fan-out phase between the AL and MB 
assures an internal representation of patterns that prevents the occurrence 
of overlaps and, therefore, enhances pattern recognition while also assures 
energetic efficiency [3]. 


Although insect olfactory system has been extensively studied, there is still 
controversy about the value that some basic parameters of the PN-KC connec- 
tivity may take and about the topology that these connections follow. One of 
these parameters is the connection probability p. between PNs and KCs. For 
example, in the case of locust, some authors have suggested that p. = 0.5 [6], 
while others argue that it would be p. = 0.1 [11] or even lower than that [3]. 

Regarding connection topology, as the PN-KC connections are not learned 
and cannot be reproduced across individuals, they are usually modeled using a 
stochastic matrix, since this is sufficient to assure information transmission, low 
energy cost [3] and is also used in artificial neural networks [13]. However, recent 
findings on the olfactory system of Drosophila points toward a more structured 
connectivity pattern where certain subsets of KCs receive a different number of 
connections from PNs [2]. 

In this work, using a simple model of the olfactory system based on neural 
networks and supervised learning [4,5,8,10], we aim to test the effects of the 
three strategies presented above and the new connectivity proposed, checking 
whether they improve the discrimination capacity of the network. 


2 Methods 


2.1 Model of the Insect Olfactory System 


The model we proposed is based on a single hidden layer neural network and 
supervised learning and includes the three mechanisms to enhance pattern dis- 
crimination presented in the Sect. 1. 

A graphic representation of the model and all the details are shown in Fig. 1. 
Basically, the input layer of the neural network X represent the PNs, while the 
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Fig.1. (a) Structure of the biological olfactory system of insects. The ORNs capture 
the information of odorants and send it to AL and from there to the MB, where KCs 
use sparse coding to represent it. The MBONs are responsible for the final identification 
of stimuli. (b) Computational model used to explore the strategies to improve pattern 
recognition. It is a SLFN that uses supervised learning to determine the weights of the 
matrix W and the neural thresholds of KCs (0) in the hidden layer Y and MBONs (e) 
in the output layer (Z). AL is the input (X) to the network and is connected to the 
KCs by a binary matrix C. No learning takes place at this layer. 


hidden layer Y represents the KCs, and should show sparse activity. The output 
layer, Z, represents the MBONs. 

Gain control is introduced through the renormalization of patterns in the 
input layer so that the activation of the neurons is uniform for all patterns 
[4]. To achieve an heterogeneous neural threshold distribution in the KCs, the 
learning algorithm of the network is capable of adjusting the thresholds of each 
KC (6) and MBON (e) to the values that best fit the classification problem. 
Apart from that, the weights W of the connections between KCs and MBONs 
are also adjusted, since it is known that associative learning happens in this 
layer. 

To enable sparse coding in the KC layer, we introduce an activity regulation 
term (ART) in the learning rule that allows us to 2control the level of activity 


of the neurons: 
th a 2 : 
meine) ı a) 


where the parameter Ngo is the number of KCs, s € [0.0, 1.0] allows to control 
the level of activity in the KCs layer from no activity when s = 0 to maximum 
activity when s = 1, and y; is the activation of each KC in the network. 

Another mechanism to enable sparse coding is the PN-KC ratio, which has 
been set to 1:50 [3,11], same as in the olfactory system of locust, to assure the 
fan-out phase between these layers and allow the representation of the stimuli 
without overlapping between them [3]. 

Given all the above, the objective function the network has to minimize in 
order to resolve the classification task is the following: 


E(Z,T,Y) = H(Z,T) + ART(Y), (2) 
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where Z is the output of the model, Y the activation of KCs, T' the objective 
data labels for each pattern, H(Z,T) is the cross-entropy function and ART the 
activity regulation term for the KC layer. Further details on the derivation of 
the learning rule for neural thresholds will be provided on later publications. 


2.2 PN-KC Connectivity 


To implement PN-KC connectivity, we follow the approach used in most insect 
olfactory system models, using a stochastic binary matrix C where each con- 
nection between a KC and a PN exists with a probability p. that can be set to 
different values from p. = 0.0 to p. = 1.0 [3,5]. 

Conversely, taking into account the recent finding in Droshophila, we test a 
different PN-KC connectivity based on what is described in [2]. KCs are divided 
into different subsets depending on how many PNs they are connected to. Hence, 
there is a population of one-claw KCs that are just connected to a single PN, two- 
claw KCs, up to six-claw KC, the maximum observed in the biological system 
[2]. So, in this model, the connection probability p. will depend entirely on the 
proportions of each type of KC. 


2.3 Input Patterns 


To test the performance of the model in a pattern classification problem, we 
use a reduced version of the well known MNIST dataset for hand-written digit 
recognition [1] that consists of 940 patterns, 209 attributes and 10 different 
classes. Some samples of these patterns are shown in Fig. 2 panel (c). We choose 
these patters because they are presented to the network as a one dimensional 
array with different activity regions similar to the complexity of the odorant 
patterns the biological system encounters in nature [12]. 


3 Results 


For simulations, we use the locust olfactory system as reference, where PN-KC 
ratio is 1:50. The size of the neural network is 210 x 10451 x 10. We use 5-fold 
cross validation and execute 10 simulations of each trial to compute the average 
of the classification error, that we use as the measure of system performance. 


3.1 Level of Sparseness and PN-KC Connection Probability 


We compared the performance of the model including the ART with different 
values of s to control the level of activity on KCs with the performance of 
the model without ART. The PN-KC connectivity is implemented using the 
stochastic matrix approach for values of p. biologically plausible, between 0.01 
and 0.5. Results are shown in Fig.2. A model including the activity regulation 
term outperforms one that lacks it for most of the combinations of s and pe 
values. The model achieves the best results when the sparseness level in KC layer 
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Fig. 2. (a) Classification error for the handwritten digits dataset for different values 
of pe and low, medium and high sparseness level. (b) Mean activation level for KCs. 
High sparseness values correspond to lower activation in KCs and assures energetic 
efficiency. (c) A sample of the patterns used for classification. 


is high, which is consistent with what is observed in the biological system [3,11]. 
Also, for this sparseness level, the lowest error rates happen when the connection 
probability between AL and MB neurons is low, in the interval [0.01-0.3]. The 
result we obtained is within the range of values considered possible and it is also 
consistent with energetic efficiency, but it is still lower than the more generally 
accepted value, pe = 0.5 [3,6,11]. In the mean KC activation plot for different 
values of s and pe, it can be seen that when sparseness is high, the KCs show a 
level of activation between 10% and 20%. This result is also consistent with the 
biological facts, since the level of activity for KCs is very low due to the sparse 
coding they used, according to [11]. 

Therefore, for high sparseness levels and low PN-KC probability connections 
the network is able to reach an optimal codification, maximizing the transference 
of information and minimizing the energy costs. 


3.2 Gain Control 


In order to test the influence of the gain control mechanism in the performance of 
the model, we carry out simulations with high sparseness level with and without 
gain control for different pe values. Results are shown in Fig. 3. When the model 
works without gain control, its behavior is more stable and its performance 
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is independent of p. value. But the minimum classification error can only be 
achieved when gain control is enabled. It seems that only in the cases where p. 
is very low, in the range [0.01-0.2], gain control has a positive effect, reducing 
the error by 5%, while for greater values of pe, it is counterproductive. This 
behavior can be explained by the nature of the patters used for classification, 
as handwritten digits can be classified by the level of activity they cause in 
neurons. When p. is low and a little number of connections between PNs and KCs 
are available, gain control helps to maximize the transmission of information, 
while in the case of bigger p., when almost all the information is available and 
transmitted, it makes the level of activity uniform and therefore it elimates 
some important information for discrimination. Also, it should be noticed that 
our mechanism for gain control is fairly limited, so these results should be further 
tested with more realistic gain control models. 
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Fig. 3. Classification error for high sparseness level and different p. values with and 
without gain control. Gain control only has a positive effect for low values of p., improv- 
ing the performance by 5%. When p. > 0.2, gain control worsens the performance 
achieved by 5-10%, although this effect can be explained by the structure of the prob- 
lem patterns. 


3.3 Structured PN-KN Connectivity Model 


We introduce the structured connectivity model explained in Sect.2.2, where 
KCs are divided into different sets, each of them receiving connections from a 
certain number of PCs. There can be six different types of KCs in the system, 
from single-claw KCs to six-claw KCs, the maximum observed in the biological 
system. The proportions of each type are the same found in [2]. Results in [2] 
show that the new connectivity minimizes redundancy and optimizes stimuli dis- 
crimination. We wonder how this connectivity pattern could affect the behavior 
of our neural network, so we extrapolate this connectivity model to the locust, 
where the size of the network is bigger, by just maintaining the proportions 
mentioned before. The value of p. that corresponds to this connectivity pattern 
is pe = 0.0124 (Fig. 4). 
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Fig. 4. Classification error using the stochastic matrix connectivity model for different 
pe and using the structured connectivity model with pe = 0.0124, for locust (1:50 
PN-KC ratio) and Drosophila (1:10 PN-KC ratio) configurations. 


Simulations with this connectivity model do not include the gain control 
mechanism, neither the activity regulation term. However, this topology achieves 
a Classification error similar to the model working with gain control, activity 
regulation term and p. = 0.01. Results suggest that this new topology could 
better sample the PN population and maximize the information transmitted to 
the KCs while also providing some mechanism for gain control. 


4 Conclusions 


In this paper we have introduced an insect olfactory system model based on 
neural networks and supervised learning that includes three strategies to improve 
the pattern recognition capability of the system. These mechanisms are gain 
control, sparse coding and an heterogeneous threshold distribution in KCs. Apart 
from adjusting the weights of the connections between KCs and MBONs, the 
model is also able to get the distribution of thresholds that best fit a certain 
classification problem for KCs and MBONSs. To control the level of activity in 
the KC layer and allow sparse coding, we introduce a new regulation term that 
allows us to choose the activity level. 

We carry out simulations for different parameters of the model to study how 
these mechanisms influence the performance of the system in a classification task. 
We have shown that the system achieves the minimum error when the sparseness 
level in the KC layer is high (activity level of 10%) and the PN-KC connection 
probability low. Also, gain control has a positive impact on the performance of 
the system, but only for low p., due to the structure of the patterns used for 
the classification task. However, our mechanism for gain control is not much 
realistic, so further investigation must be carried out on this particular point. 

We also tested a new PN-KC connectivity topology proposed in [2] with a 
different structure to the classic stochastic matrix approach and found that this 
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connectivity can reach the minimum error without making use of gain control 
or the activity regulation term. Hence, the behavior and properties that this 
new connectivity may introduce could be helpful to understand the processing 
of information in the olfactory system and to end controversies about the value 
of certain of its parameters. The potential properties of this new connectivity 
will be further explored in the future. 
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Abstract. In this paper we propose a conceptual framework for higher-order 
artificial neural networks. The idea of higher-order networks arises naturally 
when a model is required to learn some group of transformations, every element 
of which is well-approximated by a traditional feedforward network. Thus the 
group as a whole can be represented as a hyper network. One of typical examples 
of such groups is spatial transformations. We show that the proposed framework, 
which we call HyperNets, is able to deal with at least two basic spatial trans- 
formations of images: rotation and affine transformation. We show that Hyper- 
Nets are able not only to generalize rotation and affine transformation, but also to 
compensate the rotation of images bringing them into canonical forms. 


Keywords: Artificial neural networks - Higher-order models 
Affine transformation - Rotation compensation - Currying neural networks 
HyperNets 


1 Introduction 


Generalization properties of different neural networks architectures have been of 
interest since the invention of these type of models. Theoretical and empirical studies of 
models’ generalization properties remain relevant till present [1]. In addition, this 
problem has a very special place in the field of computer vision: it is crucial for a 
general-purposed computer vision system to learn the invariant representations of 
sensor inputs [2]. Classic feedforward discriminative architectures even for deep 
models have been studied decently, and it seems like their generalization properties are 
quite restricted since such models cannot directly transfer the results of previous 
learning to very new domains [3]. Moreover, even whilst working in the same domain 
but with great variability in data, these models still give very poor results. A very 
instructive example is the inability of multilayered perceptron to effectively recognize 
rotated versions of handwritten digits while being trained on canonical ones [4]. 
Convolutional neural networks partially address the problem of invariant features by 
making assumptions of locality and shared parameters. However, these assumptions are 
yet not enough to force different types of ConvNets learn to distinguish between rotated 
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digits without training on rotated examples [5, 6]. Recently, new models named capsule 
networks have been proposed [7], which are aimed to treat the invariance problem in a 
very specific way. Capsules are intended to store additional pieces of information in a 
basic neuron structure that could result in learning of non-trivial spatial relationships 
between the elements of sensor input on different levels of abstraction. However, 
training methods for CapsNets are still not as efficient as traditional version of gradient 
descent due to intensive process of dynamic routing. 

It is also interesting that generalization properties of traditional models, which have 
been trained to reconstruct the original or canonical representations of modified inputs 
in autoencoder style, are also weak, especially if data domain has changed [8, 9]. 

Nonetheless, one of the most successful techniques to address the problem of 
geometric transform compensation for input images is the usage of spatial transforming 
layers [10]. Usually such layer consists of three main parts: localization net, grid 
generator and sampler. Such architecture allows for explicit spatial manipulation with 
data within a network for a wide family of parameterized spatial transformations. It has 
been shown that models with spatial transforming layers generally have increased 
classification accuracy. However, the concept of ST-layer has several drawbacks: the 
necessity to choose only differentiable sampling kernels (e.g., the bilinear), explicit 
representation of parameterized family of transformations, dependency on grid gen- 
eration and some domain specificity. 

As an alternative, we present a technique, which we call hyper-neural networks or 
simply HyperNets. This is a method for manipulating model parameters by another 
model. Herewith, one deep neural network can represent both models. Let us consider 
the HyperNets in more detail. 


2 Main Idea 


Consider a network that accepts an image as input and produces its transformed ver- 
sion. The network with dense connections between input and output can easily learn to 
apply any (but fixed) spatial transformation to an arbitrary image. E.g. it can learn to 
rotate an image by 45°, or flip an image vertically, etc., but the weights of connections 
will be different for each individual transformation. 

Imagine we want the network to learn how to rotate an image by arbitrary angle 
provided as an input without hard-coding a special (non-neural) procedure for spatially 
transforming images. If we add traditional neurons accepting the parameters of trans- 
formation as input in addition to image, the network will just mix the image content 
with these parameters. Even making the network deeper and appending its latent code 
with the transformation parameters does not help the network to learn how to transform 
images independently of their content as we shall see later. 

Thus, if we have an input image x e X and transform parameters p € O, it is 
convenient to represent transformation process as mapping X x ® — X: 


x = f(x|@), (1) 
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where x’ denotes the transformed image. Given a labeled training set, a traditional 
model tries to learn an approximation g to the function f: 


f(xl@) © go(x, 9), (2) 


where 0 denotes adjustable parameters of the model. In such approach ¢ is usually 
treated as an additional vector of input values that could be connected to an arbitrary 
layer of the model presented by a deep network (Fig. 1). As we shall see later in this 
case the model will not have enough generalization properties. 

Instead we can do some form of currying for the function g. As a result we obtain a 
new function curry(g): 


curry(g) = hu(P) = (p+ go(x, @)), (3) 


which should also have trainable parameters œ. So now we can directly search for the 
mapping h: ® — (X — X). 

Since it is easy for networks to learn how to transform images by an individual 
transformation, but their trained weights depend on the parameters of this transfor- 
mation, it seems quite natural to introduce control neurons, which take these parameters 
as input and modulate the connection weights of the controlled network through 
higher-order connections. 

The core idea behind HyperNets is representation of neural networks as higher- 
order functions, which implies a very special network architecture where function 
ha (p) is a neural network too (Fig. 2). This means that parameters O of the network 
g are described as outputs of the network h,(P): Op) = hu(p). Thus, we try to 
approximate the target function f by the following model: 


Fxl0) © 81.00 (x). (4) 


In case of complex models g, especially some deep ones, not all of the parameters 0 
could depend on transformation parameters q. Hence, we can rewrite our expression 
for a higher-order model using a slightly redundant notation: 


flo) x &n.(9).0' (x), (5) 


where 0' denotes the parameters of the model g that are not affected by higher-order 
terms, i.e. 0 = {h,(@), 0'}. Again, having the training set D comprised by m pairs of 
canonical and transformed images along with respective transformation parameters: 
D = {x,,X,@;}:_ |, the goal of the model is to learn the transformation concept h.,(p) 
and its properties by minimizing some error/loss function and, thus, finding optimal 
values for 0” and w: 


or, 0* = argmin,,g (E L(x, Zi = Sh,(@;),0' )) > (6) 


i=l 
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where L(x, z) is the corresponding loss function between the model’s output and the 
target image. 

The name ‘hyper network’ comes from the analogy between the higher-order 
functions represented by neural nets and hypergraphs, which could be considered as an 
extension of traditional computational graph approach. 

In this work we have considered relatively simple HyperNet architectures. The 
interaction between parameters of the higher-order part and the ‘core’ part of the model 
presented in Fig. 2 can be described as follows: 


Whi-ord = softmax(a(@)) (7) 
Z= (Whi-oraW)X, (8) 


where a is an activation of the last layer of the higher-order part of the network, W 
denotes the parameters (weights) of the ‘core’ part of the model and &® denotes 
element-wise product between matrices. 
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Fig. 1. Regular autoencoder with control (sin and cos of desired angle) 


It can be seen that the model, whilst being structurally complex, remains differ- 
entiable, which allows to directly apply standard optimization techniques under various 
computational graph frameworks and to simultaneously train both higher-order and 
“core” parts of the network. It is also worth saying that transformation parameters (p 
could be represented in numerous ways. For example, in case of planar image rotation 
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Fig. 2. Simple higher-order model architecture 
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q might be parameterized by one angle a with activation constrains for the next hidden 
layer of the network or by two values representing sin(a) and cos(a). So the input to 
higher-order part of the network will be the transformation parameter, and the input for 
the ‘core’ part of the network will be the image. For example, if you want to train a 
model to rotate an image, you may use cos and sin of an angle as parameters and the 
unrotated image as an input. The rotated image will be the desired output. Thus, the 
model is trained in supervised style. 


3 Experiments and Results 


We have performed several types of experiments using a developed HyperNet. Also we 
have tried many architecture modifications, such as adding more dense layers to ‘core’ 
network, using convolutional layers, etc. 


3.1 Rotation Experiment 


The first experiment is a simple rotation generalization. Previously, we have discussed 
what will be the input and output in this scenario. In this experiment we have used 
simple HyperNet, deep HyperNet, deep convolutional autoencoder (AE). Besides of 
different kind of models, we have tried to learn angles in two different ways: using all 
angles [0, 360] during training and testing process, and discrete angles (0, 45, 90, 135 
... 360°) during training, but all angles while testing (interpolation experiment). Below 
we presented only results for discrete angles learning for HyperNet and all angles 
learning for AE. Also, both experiments (with AE and HyperNet) included some 
extrapolation part for 4 and 9 digits. For these two digits the angles were taken not from 
[0, 360], but from [0, 90] range of degrees while training. The results of these 
experiments are shown in Fig. 3. 


TER AB See a} 


Fig. 3. Test results of simple HyperNet trained on discrete angles. Digits 1 and 4 


In the figure above, the odd columns contain groundtruth rotated images, the even 
columns contain images rotated by a simple HyperNet. During testing phase only digits 
that had not been present in training set were used. Again, as an input to higher-order 
part of the network sin and cos of the desired angle were used. Moreover, we have 
tested our model, previously trained solely on digits, to rotate letters. The results are 
shown in Fig. 4 (left). As can be seen, after adding the higher-order weights, a simple 
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model with just one input and output could be trained to generalize rotation. However, 
there are some artifacts present on the images, especially if you take a look at digit 1. 
Hence, a simple model was unable to interpolate rotation on discrete angles learning. 


Fig. 4. Test results of simple (left) and deep convolutional (right) HyperNet applied to letters. 
Letters have not been used for training 


A slightly deeper model with two convolutional, two dense and two deconvolu- 
tional layers (higher-order weights are applied to weights between two dense layers) 
could return better results, as you can see in Fig. 5. These results were obtained from 
training on discrete angles and it can be seen that deep higher-order network could 
already interpolate rotation. At the right part of Fig. 4 the results of testing by deep 
HyperNet on letters are presented. 
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Fig. 5. Test results of deep HyperNet trained on discrete angles. Digits 1 and 4 


We have also compared reconstruction loss between HyperNet and baseline con- 
volutional autoencoder (AE), which had been designed to learn the rotation transform 
using information about sin and cos of desired angle added to latent code (see Fig. 7). 
Ten graphs mean ten digits. Also, in Fig. 6 the results of AE rotation generalization are 
presented. It is worth saying though, that AE was trained on all angles, not only 
discrete. It also has to be mentioned, that HyperNet was able to extrapolate rotation 
representation for 4 and 9 while regular AE with control could not do this. 
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Fig. 7. Reconstruction loss comparison of HyperNet models with baseline autoencoder 


3.2 Affine Transformation Experiment 


Of course, rotation generalization is not such an interesting task. Affine transformation 
generalization is more challenging. In this case, we have used six affine transformation 
parameters as a higher-order input, and the canonical image as an input for the core part 
of the network. The transformed image was used as the desired output. It is worth 
saying though, that affine transformation parameters were limited in their range to 
ensure that the digit is still present on the 28 x 28 image and is recognizable for 
human. In Fig. 8 you can see the results. In this experiment we are presenting only 
deep model (with convolutional layers), since simple model shows worse results 
(though still decent). 

As you can see, the convolutional HyperNet was able to learn almost random affine 
transformation and to apply it to digits that had not been contained in the training set. 
Though a simple model still can generalize affine transformation thanks to the higher- 
order part network, deeper network shows much smoother results. The ability of the AE 
model to learn affine transformation was also tested and the results are shown in Fig. 9. 
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Fig. 9. Results of AE model for the training set (left) and test set (right) 


The AE model shows decent results, however some blurring and artifacts are 
present at these pictures. The comparison between AE and HyperNet for the affıne 
transformation generalization experiment is presented in Table 1. 


Table 1. Comparison between AE and higher-order model 


Simple HyperNet | Deep HyperNet | Autoencoder 
Reconstruction loss | 0.049395 0.0138223 0.0606265 


3.3 Rotation Compensation Experiment 


In previous experiments we have tried to learn a model to transform or to simply rotate 
an input image using control parameters and higher-order architecture. In this last part 
we were interested in compensating rotation without any knowledge of the angle. This 
means that control parameters in this scenario will be not sin and cos, since this would 
be an inverse problem, not so interesting and challenging. But what could be used as an 
input to the HyperNet then? We have tried to use the rotated image as an input to the 
core network AND as an input to the higher-order network. And, of course, the 
canonical image as the desired output. The idea is that the higher-order part of network 
could possibly extract parameters of the transformation from the rotated image by 
itself. In this case the dynamics of the higher-order part of network can be described as 
follows: 


Wni-ora = softmax(a(x)). (9) 
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But we had to slightly deepen the higher-order part of the network to ensure that it 
could do such a thing. So, in this experiment, the higher-order part of the network 
consists of two convolutional layers and one dense layer. Let us see some results in 
Fig. 10. Only the results of convolutional HyperNet are shown again since it has 
performed better in previous experiments. 
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Fig. 10. Results of rotation compensation using deep HyperNet. Test set. Digits 6, 1 and 9 


In the figures above, the odd columns are the rotated input images and the even 
columns are the canonical images, which were received from the network. Most 
interesting results are 6 and 9 digits, since when rotated 180 degrees, 6 actually 
becomes 9. So, the one could expect that 6 and 9 would be mistaken by the network. 
However, the HyperNet was able to somehow correctly compensate rotated digits, 
including 6 and 9. There are some artifacts at the images, but overall the quality is 
good. Figure 12 presents the results of AE rotation compensation experiment, and 
Fig. 11 shows the comparison graphs between these two models. Just to remind, 
models were trained on digits 4 and 9 that had been rotated only in [0, 90] range of 
degrees. That explains the difference in graphs for these two digits for the AE model. 
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Fig. 11. Comparison of HyperNet and autoencoder applied for rotation compensation 
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Fig. 12. Rotation compensation results using AE model. Test set. Digits 6, 1 and 9 


4 Conclusion 


In this article we have proposed a new approach to artificial neural networks based on 
generating networks’ parameters by higher-order modules that constitute other net- 
works themselves. In other words, the output of the higher-order part acts as a weight 
matrix for the core part of the network. It has been shown that even a simple HyperNet 
with just one input layer and one output layer in its core part can generalize rotation and 
affine transformation. The addition of convolution layers allows to receive smoother 
results. Moreover, deep HyperNet allows to compensate rotation without any infor- 
mation about the angle. In future work it is possible to use such approach to com- 
pensate other types of transformations or to extrapolate such approach on generative 
models. 

Our code is availiable on github https://github.com/singnet/semantic-vision/tree/ 
master/experiments/invariance/hypernets. 
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Abstract. We investigate the performance of DNNs when trained on 
class-incremental visual problems consisting of initial training, followed 
by retraining with added visual classes. Catastrophic forgetting (CF) 
behavior is measured using a new evaluation procedure that aims at 
an application-oriented view of incremental learning. In particular, it 
imposes that model selection must be performed on the initial dataset 
alone, as well as demanding that retraining control be performed only 
using the retraining dataset, as initial dataset is usually too large to be 
kept. Experiments are conducted on class-incremental problems derived 
from MNIST, using a variety of different DNN models, some of them 
recently proposed to avoid catastrophic forgetting. When comparing our 
new evaluation procedure to previous approaches for assessing CF, we 
find their findings are completely negated, and that none of the tested 
methods can avoid CF in all experiments. This stresses the importance of 
a realistic empirical measurement procedure for catastrophic forgetting, 
and the need for further research in incremental learning for DNNs. 


Keywords: DNN - Catastrophic forgetting - Incremental learning 


1 Introduction 


The context of this article is the susceptibility of DNN to an effect usually termed 
“catastrophic forgetting” or “catastrophic interference” [2]. When training a 
DNN incrementally, that is, first training it on a sub-task Dı and subsequently 
retraining on another sub-task D2 whose statistics differ (see Fig. 1), CF implies 
an abrupt and virtually complete loss of knowledge about D during retraining. 
In various forms, knowledge of this effect dates back to very early works on 
neural networks [2], of which modern DNNs are a special case. Nevertheless, 
known solutions seem difficult to apply to modern DNNs trained in a purely 
gradient-based fashion. Recently, several approaches have been published with 
the explicit goal of resolving the CF issue for DNNs in incremental learning tasks, 
illustrated in [3,5,10]. On the other hand, there are “shallow” machine learning 
methods explicitly constructed to avoid CF (reviewed in, e.g., [9]), although 
© Springer Nature Switzerland AG 2018 


V. Kurkova et al. (Eds.): ICANN 2018, LNCS 11139, pp. 487-497, 2018. 
https: //doi.org/10.1007/978-3-030-01418-6_48 


488 B. Pfülb et al. 


sub-task Dı sub-task D2 


— train:Di,test:D1].....'... 
— train:D2,test:D2 


8 0.41| — train:DitesttD1] td) 804 
—  train:D2,test:D2 : i 


= B 0-2 ' hend 02 a 
= 2 —  train:D2,test:All g — train:D2,test:All i B 
0 tax Amar 2000 3000 4000 5000 °° 1000 2000 3000 A000 5000 
iterations iteration iteration 
(a) Training scheme (b) without CF (c) with CF 


Fig. 1. Scheme of incremental training experiments (see (a)) and representative out- 
comes without and with CF (see (b) and (c)). Initial training with sub-task Dı for tmaz 
iterations is followed by retraining on sub-task D2 for another tmaz iterations. Dur- 
ing training (white background) and retraining (grey background), test performance 
is measured on Dı (blue curves), D2 (green curves) and Dı U D2 (red curves). The 
red curves allow to determine the presence of CF by simple visual inspection: if there 
is significant degradation w.r.t. the blue curves, then CF has occurred. (Color figure 
online) 


this ability seems to be achieved at the cost of significantly reduced learning 
capacity. In this article, we test the recently proposed solutions for DNNs using 
a variety of class-incremental visual problems constructed from the well-known 
MNIST benchmark [6]. In particular, we propose a new experimental protocol to 
measure CF which avoids commonly made [3,5,7, 10] implicit assumptions that 
are incompatible with incremental learning in applied scenarios. 


1.1 Application Relevance of Catastrophic Forgetting 


When DNNs are trained on a single (sub-)task D; only, catastrophic forget- 
ting is not an issue. When retraining is necessary with a new sub-task Da, one 
often recurs to retraining the DNN with all samples from Dı and Də together. 
This heuristic works in many situations, especially when the cardinality of Dı 
is moderate. When Dı becomes very large, however, or many slight additions 
Da4n) are required, this strategy becomes unfeasible, and an incremental train- 
ing scheme (see Fig. la) must be used. Thus, the issue of catastrophic forgetting 
becomes critically important, which is why we wish to assess, once and for all, 
where DNNs stand with respect to CF. 


1.2 Approach of the Article 


In all experiments, we consider class-incremental learning scenarios divided into 
two training steps on disjunct sub-tasks Dı and Da, as outlined in Sect.1 and 
visualized in Fig. 1. Both training steps are conducted for a fixed number of 
iterations, with the understanding that in practice retraining would have to be 
stopped at some point by an appropriate criterion before forgetting of Dı is com- 
plete. The occurrence of forgetting is quantified using classification performance 
on all test samples from Dı U Da at the time retraining is stopped (see Fig. 1 for 
a visual impression). In contrast to previous works, our experiments take into 
account how (class-)incremental learning works in practice: 
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Table 1. Overview over 6 DNN models used in this study. They are obtained by 
combining the concept of Dropout (D) with the basic DNN models: fully-connected 
(fc), convolutional (conv), LWTA and EWC. 


Concept Model 

fe [conv |LWTA EWC 
With Dropout D-fc | D-conv | X D-EWC (EWC) 
Without Dropout |fe |conv | LWTA-fc (LWTA) | X 


— Dg is not available at initial training 
— D; is not available at retraining time as it might be very large. 


This training paradigm (which we term “realistic”) has profound conse- 
quences, most importantly that initial model selection has to be performed using 
Dı alone, which is in contrast to previous works on CF in DNNs [3,5,10], where 
Dı U Da is used for model selection purposes. Another consequence is that the 
decision on when to stop retraining has to be taken based on Də alone. 

In order to reproduce earlier results, we introduce another training paradigm 
which we term “prescient”, where both Dı and D are known at all times, and 
which aligns well with evaluation methods in recent works. As classifiers, we use 
typical DNN models like fully-connected- (fc), convolutional- (conv), LWTA- 
based- (fc-LWTA) and DNNs based on the EWC model (EWC). Most of these 
can be combined with the concepts of Dropout (D, [4]). An overview of possible 
combinations is given in Table 1. 

For all models, hyperparameter optimization is conducted in order to ensure 
that our results are not simply accidental. 


1.3 Related Work on CF in DNNs 


In addition to early works on CF in connectionist models [2], new approaches 
specific to DNNs have recently been unveiled, some with the explicit goal of pre- 
venting catastrophic forgetting [3,5,7,10]. The work presented in [3] advocates 
the popular Dropout method as a means to reduce or eliminate CF, validating 
their claims on tasks derived a randomly shuffled version of MNIST [6] and a 
Sentiment Analysis problem. In [10], a new kind of competitive transfer func- 
tion is presented which is termed LWTA (Local Winner Takes All). In a very 
recent article [5], the authors advocate determining the hidden layer weights that 
are most “relevant” to a DNNs performance, and punishing the change of those 
weights more heavily during retraining by an additional term in the energy func- 
tional. Experiments are conducted on random data, randomly shuffled MNIST 
data as in [3,10], and on a task derived from Deep Q-learning in Atari Games 
[8]. Even more recently, authors in [7] propose the so-called incremental moment 
matching (IMM) technique which suggests an alignment of statistical properties 
of the DNN between Dı and Da which is not included here, because it inherently 
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requires knowledge of Di at re-training time to select the best regularization 
parameter(s). 


2 Methods 


The principal dataset this investigation is based on is MNIST [6]. Despite being a 
very old benchmark, and a very simple one, it is still widely used, in particular in 
recent works on incremental learning in DNNs [3,5,7, 10]. It is used here because 
we wish to reproduce these results, and also because we care about performance 
in class-incremental settings, not offline performance on the whole dataset. As 
we will see, MNIST-derived problems are more than a sufficient challenge for 
the tested algorithms, so it is really unnecessary to add more complex ones (but 
see Sect.4 for a more in-depth discussion of this issue). 


2.1 Learning Tasks 


As outlined in Sect. 1.2, incremental learning performance of a given model is 
evaluated on several datasets constructed from the MNIST dataset. The model is 
trained successively on two sub-tasks (Dı and D2) from the chosen dataset and 
it is recorded to what extend knowledge about previous sub-tasks is retained. 
The precise way the sub-tasks of all datasets are constructed from the MNIST 
dataset shall be described below. 


Exclusion: D5-5. These datasets are obtained by randomly choosing 5 MNIST 
classes for Dı, and the remaining 5 for Da. To verify that results do not depend 
on a particular choice of classes, we create a total of 8 datasets where the parti- 
tioning of classes is different (see Table 2). 


Exclusion: D9-1. We construct these datasets in a similar way as D5-5, select- 
ing 9 MNIST classes for Dı and the remaining class for Da. In order to make 
sure that no artifacts are introduced, we create three datasets (D9-la, D9-1b 
and D9-1c) with different choices for Dı and Da, see Table 2. 


Permutation: DP10-10. This is the dataset used to evaluate incremental 
retraining in [3,5,10], so results can directly be compared. It contains two sub- 
tasks, each of which is obtained by permuting each 28 x 28 image in a random 
fashion that is different between, but identical within, sub-tasks. Since both sub- 
tasks contain 10 MNIST classes, we denote this dataset by DP10-10, the “P” 
indicating permutation, see Table 2. 


2.2 Models 


We use TensorFlow/Python to implement or re-create all models used in this 
article. The source code for all experiments is available at https://gitlab. 
informatik.hs-fulda.de/ML-Projects/CF in-DNNs. 
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Table 2. MNIST-derived datasets (DS) used in this article. All partitions of MNIST 
into Dı and D are non-overlapping. For the DP10-10 dataset, the classes are identical 
for Dı and Da but pixels are permuted in Da as described in the text. 


Part. DS 


D5-5 D9-1 DP10-10 
D5-5a|D5-5b |D5-5c |D5-5d |D5-5e |D5-5f [D5-5g |D5-h [D9-la D9-1b[D9-1c 


Dı classes 0-4 [02468 34689|02567|01345|03489 05678 02368 0-8 1-9 |o, 2-9|0-9 
Də classes|5-9 |13579|01257/13489 26789|12567 12349]14579|9 lo 1 0-9 


Fully Connected Deep Network. Here, we consider a “normal” fully- 
connected (FC) feed-forward MLP with two hidden layers, a softmax (SM) read- 
out layer trained using cross-entropy, and the (optional) application of Dropout 
(D) and ReLU operations after each hidden layer. Its structure can thus be sum- 
marized as In-FC1-D-ReLU-FC2-D-ReLU-FC3-SM. In case more hidden layers 
are added, their structure is analogous. 


ConvNet. A convolutional network inspired by [1] is used here, with two hidden 
layers and the application of Dropout (D), max-pooling (MP) and ReLU after 
each layer, as well as a softmax (SM) readout layer trained using cross-entropy. 
It structure can thus be stated as In-C1-MP-D-ReLU-C2-MP-D-ReLU-FC3-SM. 


EWC. The Elastic Weight Consolidation (EWC) model has been recently pro- 
posed in [5] to address the issue of CF in incremental learning tasks. We use 
a TensorFlow-implementation provided by the authors that we integrate into 
our own experimental setup; the corresponding code is available for download 
as described. The basic network structure is analogous to that of fe models. 


LWTA. Deep learning with a fully-connected Locally-Winner-Takes-All 
(LWTA) transfer function has been proposed in [10], where it is also suggested 
that deep LWTA networks have a significant robustness when trained incremen- 
tally with several tasks. We use a self-coded TensorFlow implementation of the 
model proposed in [10]. Following [10], the number of LWTA blocks is always set 
to 2. The basic network structure is analogous to that of fully-connected models. 


Dropout. Dropout, introduced in [4] and widely used in recent research on 
DNNs, is a special transfer function that sets a random subset of activities in 
each layer to 0 during training. It can, in principle, be applied to any DNN and 
thus can be combined with all previously listed models except EWC (already 
incorporated) and LWTA (unclear whether this would be sensible as LWTA is 
already a kind of transfer function). 


2.3 Experimental Procedure 


The procedure we employ for all experiments is essentially the one given in 
Sect. 1.2, where all models listed in Sect. 2.2 and Table 1 are applied to a subset 
of class-incremental learning tasks described in Sect.2.1. For each experiment, 
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characterized by a pair of model and task, we conduct a search in model param- 
eter space for the best model configuration, leading to multiple runs per experi- 
ment, each run corresponding to a particular set of parameters for a given model 
and a given task. 

Each run lasts for 2tmax iterations and is structured as shown in Fig. 1, ini- 
tially training the chosen model first on sub-task Dı and subsequently on sub- 
task Da, each time for tax iterations. Classification accuracy, measured at itera- 
tion t, on a test set B while training on a train set A, is denoted x(A, B, t). For a 
thorough evaluation, we record the quantities y(D1, D1,t < tmax), x(Da, Da,t > 
tmax) and x(Da, D¡UD2,t > tmax). Finally, the best-suited parameterized model 
must be chosen among all the runs of an experiment. We investigate two strate- 
gies for doing this, corresponding to different levels of knowledge at training 
and retraining time during a single run. As detailed in Sect. 1.2, these are the 
strategies which we term “prescient” and “realistic”. The “prescient” evaluation 
strategy (see Algorithm 1) corresponds to an a priori knowledge of sub-task Da 
at initial training time, as well as to a knowledge about Dı at retraining time. 
Both assumptions are difficult to reconcile with incremental training in applied 
scenarios, as detailed in Sect.1.2. We use this strategy here to compare our 
results to previous works in the field [3,5,10]. In contrast, the “realistic” evalu- 
ation strategy (see Algorithm 2) assumes no knowledge about future sub-tasks 
(D2) and furthermore supposes that Dı is unavailable at retraining time due to 
its size (see Sect. 1.2 for the reasoning). It is this strategy which we propose for 
future investigations concerning incremental learning. 


2.4 Hyperparameters and Model Selection 


For runs from all experiments, not involving CNNs, the parameters that are 
varied are: number of hidden layers L € {2,3}, layer sizes S € (200, 400, 800}, 
learning rate during initial training ep, € {0.01,0.001}, and learning rate du- 
ring retraining ep, € (0.001, 0.0001,0.00001}. Based on the parameter set P C 
LxSx ep, X €p,, all models are evaluated, respectively are model-specific hyper- 
parameters used or supplanted. For experiments using CNNs, we fix the topology 
to a form known to achieve good performances on MNIST as an exhaustive 
optimization of all relevant parameters would prove too time-consuming in this 
case, and vary only the ep, and ep, as detailed before. For EWC experiments, 
the importance parameter A of the retraining run is fixed at 1/ep,, this choice 
is nowhere to be found in [5] but is used in the provided code, which is why we 
adopt it. For LWTA experiments, the number of LWTA blocks is fixed to 2 in all 
experiments, corresponding to the values used in [10]. Dropout rates, if applied, 
are set to 0.2 (input layer) and 0.5 (hidden layers), consistent with the choices 
made in [3]. For CNNs, only a single Dropout rate of 0.5 is applied for input and 
hidden layers alike. The length tx of training/retraining period is empirically 
fixed to 2500 iterations, each iteration using a batch size of 100 (batchgize). The 
Momentum optimizer provided by TensorF low is used for performing training, 
with a momentum parameter p = 0.99. 
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2.5 Reproduction of Previous Results by Prescient Evaluation 


In this experiment, we wish to determine whether it is possible to find a param- 
eterization for a given DNN model and task when there is a perfect knowledge 
about and availability of the initial and future sub-tasks. Applying the models 
listed in Sect. 2.2 to the tasks described in Sect.2.1, and using the experimen- 
tal procedure detailed in Sect. 2.3, we obtain the results summarized in Table 3 
(applying the “prescient” evaluation of Algorithm 1). We can state the following 
insights: first of all, we can reproduce the basic results from [3] using the fc 
model on DP10-10, which avoids catastrophic forgetting (contrarily to the con- 
clusions drawn in this paper: these authors consider the very modest decrease in 
performance to be catastrophic forgetting). This is however very specific to this 
particular task, and in fact all models except EWC exhibit blatant catastrophic 
forgetting behavior particularly on the D5-5 type tasks, while performing ad- 
equately if not perfectly on the D9-1 tasks. EWC performs well on these tasks 
as well, so we can state that EWC is the only tested algorithm that avoids CF 
for all tasks when using prescient evaluation. Another observation is that the use 
of Dropout, as suggested in [3], does not seem to significantly improve matters. 
The LWTA method performs a little better than fc, D-fc, conv and D-conv but 
is surpassed by EWC by a very large margin. 


2.6 Realistic Evaluation 


This experiment imposes the much more restrictive/realistic evaluation, detailed 
in Sect. 2.3 and Algorithm 2, essentially performing initial training and model 
selection only on Dı and retraining only using Da. It is this or related schemes 
that would have to be used in typical application scenarios, and thus represents 
the principal subject of this article. The performances of all tested DNN models 
on all of the tasks from Sect. 2.1 are summarized in Table 4. Plots of experimental 
results over time for the D-fc and EWC models are given in Figs. 2, 3, 4 and 5. 
The results show a rather bleak picture where only the EWC model achieves 
significant success for the D9-1 type tasks while failing for the D5-5 tasks. All 
other models do not even achieve this partial success and exhibit strong CF for 
all tasks. We can therefore observe that a different choice of evaluation procedure 
strongly impacts results and the conclusions which are drawn concerning CF in 
DNNs. For the realistic evaluation condition, which in our view is much more 
relevant than the prescient one used in nearly all of the related work on the 
subject, CF occurs for all DNN models we tested, and partly even for the EWC 
model. As to the question why EWC performs well for all of the D9-1 type task in 
contrast to the D5-5 type tasks, one might speculate that the addition of five new 
classes, as opposed to one, might exceed EWC’s capabilities of protecting the 
weights most relevant to D1. Various different values of the constant A governing 
the contribution of Fisher information in EWC were tested but with very similar 
results. 
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3 Discussion of Results and Principal Conclusions 


From our experiments, we draw the following principal conclusions: 


— CF should be investigated using the appropriate evaluation paradigms that 
reflect application conditions. At the very least, using future data for model 
selection is inappropriate, which leads to conclusions that are radically dif- 
ferent from most related experimental work, see Sect. 1.3. 

— using a realistic evaluation paradigm, we find that CF is still very much a 
problem for all investigated methods. 

— in particular: Dropout is not effective against CF; neither is LWTA. 

— the permuted MNIST task can be solved by almost any DNN model in almost 
any topology. So all conclusions drawn from using this task should be revis- 
ited. 

— EWC seems to be partly effective but fails for all of the D5-5 tasks, indicating 
that it is not the last word in this matter. 


Data: model m, sub-tasks Dı & D2, parameter set P 
Result: quality of best model Imp 
initialize gm, — —1 
foreach parameters pe P do 
initial training of mp on Dı for tmax iterations 


for t — 0 to tmar iterations do // retraining of m» on Da 
update mp on Da using batchsize 
Im, ,t 7 x(Da, Dı UD, t) 
if dm, t > Im, then Gary — Qmp st 


return qm, 
Algorithm 1. The prescient evaluation strategy. 


We write that EWC “seems to be partly effective”, meaning it solves some 
incremental tasks well while it fails for others. So we observe that there is no 
guarantee that can be obtained from a purely empirical validation approach such 
as ours; yet another type of incremental learning task might be solved perfectly 
or not at all. This points to the principal conceptual problem that we see when 
investigating CF in DNNs: there is no theory that might offer any guarantees. 
Such guarantees could be very useful in practice, the most interesting one being 
how to determine a lower bound on performance loss on Dı U Da, without having 
access to D1, only to the network state and Da. Other guarantees could provide 
upper bounds on retraining time before performance on Dı U Da degrades. 
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Table 3. Results for prescient evaluation. Please note that the performance level of 
complete catastrophic forgetting (i.e., chance-level classification after retraining with 
Dz) depends on the dataset considered: for the D5-5 dataset it is at 0.5, whereas it is 
at 0.1 for the D9-1 datasets. The rightmost column indicates the DP10-10 task which 
is solved near-perfectly by all models. 


Model | Dataset 

D5-5 D9-1 DP10-10 
D5-5a | D5-5b | D5-5c | D5-5d | D5-5e | D5-5f | D5-5g | D5-5h | D9-1a | D9-1b | D9-1c 
EWC 0.92 0.92 0.91 0.93 0.94 0.94 0.89 0.93 1.00 1.00 1.00 1.00 
fc 0.69 0.63 0.58 0.65 0.61 0.58 0.61 0.69 0.87 0.87 0.86 0.97 
D-fc 0.58 0.60 0.61 0.66 0.61 0.54 0.63 0.64 0.87 0.87 0.85 0.96 
conv 0.51 0.50 0.50 0.50 0.50 0.50 0.51 0.49 0.89 0.89 0.87 0.95 
D-conv | 0.51 0.50 0.50 0.50 0.50 0.50 0.50 0.49 0.81 0.84 0.87 0.96 
LWTA | 0.66 0.68 0.64 0.73 0.71 0.62 0.68 0.71 0.88 0.91 0.91 0.97 


Data: model m, sub-tasks Dı & Da, parameter set P 
Result: quality of best model Im, 
initialize qp — —1 
forall the parameters pe P do //determine best model parameter training Dı 
for t — 0 to tmaz iterations do 
update of m, on Dı using batchsize; Gm, + — x(Dı, Di, t) 
| if qm, + > qr then qf — dm, 3 Mp — Mp 


nitialize qm, — —1 


ue 


forall the retraining learning rates e € ep, do 
initialize gp — —1 
for t 0 to tmar iterations do //retraining of m, on Da 
update m, on Da with learning rate €; qm, + — X(Da, D2,t) 
| if gm,.t > qR then qR — qm, 1 
te — arg min: (gm, ,+ > 0.99 - qR); qm, — x(D2, D1 U Da, te) 
if dm, > dm, then gm, — qm, 


return qm, 
Algorithm 2. The realistic evaluation strategy. 


Table 4. Results for realistic evaluation. Please note that the performance level of 
total catastrophic forgetting (i.e., chance-level classification after retraining with Da) 
depends on the dataset: for the D5-5 dataset it is at 0.5, whereas it is at 0.1 for the 
D9-1 datasets. The rightmost column indicates the DP10-10 task (“permuted MNIST”) 
which is again solved near-perfectly by all models. 


Model | Dataset 

D5-5 D9-1 DP10-10 
D5-5a | D5-5b | D5-5c | D5-5d | D5-5e | D5-5f | D5-5g | D5-5h | D9-1a | D9-1b | D9-1c 
EWC 0.48 0.56 0.62 0.52 0.58 0.58 0.55 0.53 0.82 0.91 0.97 0.99 
fe 0.47 0.49 0.50 0.50 0.48 0.49 0.50 0.49 0.15 0.10 0.23 0.97 
D-fc 0.47 0.50 0.50 0.50 0.49 0.49 0.50 0.49 0.52 0.10 0.16 0.96 
conv 0.48 0.50 0.50 0.50 0.49 0.50 0.51 0.49 0.29 0.33 0.11 0.95 
D-conv | 0.48 0.50 0.50 0.50 0.45 0.50 0.50 0.49 0.24 0.22 0.14 0.96 
LWTA | 0.47 0.50 0.50 0.50 0.49 0.49 0.51 0.49 0.48 0.29 0.66 0.97 
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Fig. 2. Best EWC runs on D9-1 datasets in the realistic evaluation condition. 
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Fig. 4. Best D-fc runs on D9-1 datasets in the realistic evaluation condition. 
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Fig. 5. Best D-fc runs on D5-5 datasets in the realistic evaluation condition. 
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4 Future Work 


The issue of CF is a complex one, and correspondingly our article and our exper- 
imental procedures are complex as well. There are several points where we made 
rather arbitrary choices, e.g., when choosing the constant u = 0.99 in the real- 
istic evaluation Algorithm 2. The results are affected by this choice although we 
verified that the trend is unchanged. Another weak point is our model selection 
procedure: a much larger combinatorial set of model hyper-parameters should be 
sampled, including Dropout rates, convolution filter kernels, number and size of 
layers. This might conceivably allow to identify model hyperparameters avoiding 
CF for some or all tested models, although we consider this unlikely. Lastly, the 
use of MNIST might be criticized as being too simple: this is correct, and we are 
currently doing experiments with more complex classification tasks (e.g., SVHN 
and CIFAR-10). However, as our conclusion is that none of the currently pro- 
posed DNN models can avoid CF, this is not very likely to change when using 
an even more challenging classification task (rather the reverse, in fact). 
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Abstract. Online class imbalance learning constitutes a new problem 
and an emerging research topic that focusses on the challenges of online 
learning under class imbalance and concept drift. Class imbalance deals 
with data streams that have very skewed distributions while concept drift 
deals with changes in the class imbalance status. Little work exists that 
addresses these challenges and in this paper we introduce queue-based 
resampling, a novel algorithm that successfully addresses the co-existence 
of class imbalance and concept drift. The central idea of the proposed 
resampling algorithm is to selectively include in the training set a subset 
of the examples that appeared in the past. Results on two popular bench- 
mark datasets demonstrate the effectiveness of queue-based resampling 
over state-of-the-art methods in terms of learning speed and quality. 


Keywords: Online learning - Class imbalance - Concept drift 
Resampling - Neural networks - Data streams 


1 Introduction 


In the area of monitoring and security of critical infrastructures which include 
large-scale, complex systems such as power and energy systems, water, trans- 
portation and telecommunication networks, the challenge of the state being nor- 
mal or healthy for a sustained period of time until an abnormal event occurs is 
typically encountered [10]. Such abnormal events or faults can lead to serious 
degradation in performance or, even worse, to cascading overall system failure 
and breakdown. The consequences are tremendous and may have a huge impact 
on everyday life and well-being. Examples include real-time prediction of haz- 
ardous events in environment monitoring systems and intrusion detection in 
computer networks. In critical infrastructure systems the system is at a healthy 
state the majority of the time and failures are low probability events, therefore, 
class imbalance is a major challenge encountered in this area. 

Class imbalance occurs when at least one data class is under-represented 
compared to others, thus constituting a minority class. It is a difficult problem 
as the skewed distribution makes a traditional learning algorithm ineffective, 
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specifically, its prediction power is typically low for the minority class examples 
and its generalisation ability is poor [16]. The problem becomes significantly 
harder when class imbalance co-exists with concept drift. There exists only a 
handful of work on online class imbalance learning. Focussing on binary classifi- 
cation problems, we introduce a novel algorithm, queue-based resampling, where 
its central idea is to selectively include in the training set a subset of the negative 
and positive examples by maintaining a separate queue for each class. Our study 
examines two popular benchmark datasets under various class imbalance rates 
with and without the presence of drift. Queue-based resampling outperforms 
state-of-the-art methods in terms of learning speed and quality. 


2 Background and Related Work 


2.1 Online Learning 


In online learning [1], a data generating process provides at each time step t a 
sequence of examples (x*, y?) from an unknown probability distribution p! (x, y), 
where x! € R? is an d-dimensional input vector belonging to input space X and 
yt € Y is the class label where Y = {cı,...,cn} and N is the number of classes. 
An online classifier is built that receives a new example x’ at time step t and 
makes a prediction 9°. Specifically, assume a concept h : X — Y such that ĝt = 
h(a‘). The classifier after some time receives the true label y*, its performance is 
evaluated using a loss function J = (yt, ĝt) and is then trained i.e. its parameters 
are updated accordingly based on the loss J incurred. The example is discarded 
to enable learning in high-speed data streaming applications. This process is 
repeated at each time step. Depending on the application, new examples do not 
necessarily arrive at regular and pre-defined intervals. 

We distinguish chunk-based learning [1] from online learning where at 
each time step t we receive a chunk of M > 1 examples Ct = {(xt, yt) pM 
Both approaches build a model incrementally, however, the design of chunk- 
based algorithms differs significantly and, therefore, the majority is typically 
not suitable for online learning tasks [16]. This work focuses on online learning. 


2.2 Class Imbalance and Concept Drift 


Class imbalance [6] constitutes a major challenge in learning and occurs when 
at least one data class is under-represented compared to others, thus constitut- 
ing a minority class. Considering, for example, a binary classification problem, 
class 1 (positive) and 0 (negative) constitutes the minority and majority class 
respectively if p(y = 1) < p(y = 0). Class imbalance has been extensively stud- 
ied in offline learning and techniques addressing the problem are typically split 
into two categories, these are, data-level and algorithm-level techniques. 
Data-level techniques consist of resampling techniques that alter the train- 
ing set to deal with the skewed data distribution, specifically, oversampling 
techniques “grow” the minority class while undersampling techniques “shrink” 
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the majority class. The simplest and most popular resampling techniques are 
random oversampling (or undersampling) where data examples are randomly 
added (or removed) respectively [16,17]. More sophisticated resampling tech- 
niques exist, for example, the use of Tomek links discards borderline examples 
while the SMOTE algorithm generates new minority class examples based on 
the similarities to the original ones. Interestingly, sophisticated techniques do 
not always outperform the simpler ones [16]. Furthermore, since their mecha- 
nism relies on identifying relations between training data, it is difficult to be 
applied in online learning tasks, although some initial effort has been recently 
made [13]. 

Algorithm-level techniques modify the classification algorithm directly to 
deal with the imbalance problem. Cost-sensitive learning is widely adopted and 
assigns a different cost to each data class [17]. Alternatives are threshold-moving 
[17] methods where the classifier’s threshold is modified such that it becomes 
harder to misclassify minority class examples. Contrary to resampling methods 
that are algorithm-agnostic, algorithm-level methods are not as widely used [16]. 

A challenge in online learning is that of concept drift [1] where the data 
generating process is evolving over time. Formally, a drift corresponds to a change 
in the joint probability p(x, y). Despite that drift can manifest itself in other 
forms, this work focuses on p(y) drift (i.e. a change in the prior probability) 
because such a change can lead to class imbalance. Note that the true decision 
boundary remains unaffected when p(y) drift occurs, however, the classifier’s 
learnt boundary may drift away from the true one. 


2.3 Online Class Imbalance Learning 


The majority of existing work addresses class imbalance in offline learning, while 
some others require chunk-based data processing [8,16]. Little work deals with 
class imbalance in online learning and this section discusses the state-of-the-art. 

The authors in [14] propose the cost-sensitive online gradient descent 
(CSODG) method that uses the following loss function: 


J = (Lyo + Tye 2) Uy, 9) (1) 
Wn 

where Icondition 18 the indicator function that returns 1 if condition is satisfied 
and 0 otherwise, 0 < Wp, Wn < 1 and wp + Wn = 1 are the costs for posi- 
tive and negative classes respectively. The authors use the perceptron classifier 
and stochastic gradient descent, and apply the cost-sensitive modification to the 
hinge loss function achieving excellent results. The downside of this method is 
that the costs need to be pre-defined, however, the extent of the class imbalance 
problem may not be known in advance. In addition, it cannot cope with con- 
cept drift as the pre-defined costs remain static. In [5], the authors introduce 
RLS ACP which is a cost-sensitive perceptron-based classifier with an adaptive 
cost strategy. 
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time step t: 0 1 2 3 4 5 6 
Ei 6 EN; 
z= (x*, y"): z? gi z? z? z* Zz z® 
eee 
queue qt: |z° | z? | z? E z3 E z* E z5 E z> |z 


t. 
queue qy: | z! | z! | Ze | z! | Ze ES E 


Fig. 1. Example of Queuez resampling (Color figure online) 


A time decayed class size metric is defined in [15] where for each class cx, its 
size sy is updated at each time step t according to the following equation: 


si, = Os, + Iytae, (1 — 9) (2) 


where 0 < 0 < 1 is a pre-defined time decay factor that gives less emphasis on 
older data. This metric is used to determine the imbalance rate at any given 
time. For instance, for a binary classification problem where the positive class 
constitutes the minority, the imbalance rate at any given time t is given by s$ /s,.. 

Oversampling-based online bagging (OOB) is an ensemble method that 
adjusts the learning bias from the majority to the minority class adaptively 
through resampling by utilising the time decayed class size metric [15]. An under- 
sampling version called UOB had also been proposed but was demonstrated to be 
unstable. OOB with 50 neural networks has been shown to have superior perfor- 
mance. To determine the effectiveness of resampling solely, the authors examine 
the special case where there exists only a single classifier denoted by OOBg,. 
Compared against the aforementioned RLS ACP and others, OOB,, has been 
shown to outperform the rest in the majority of the cases, thus concluding that 
resampling is the main reason behind the effectiveness of the ensemble [15]. 

Another approach to address drift is the use of sliding windows [8]. It can be 
viewed as adding a memory component to the online learner; given a window of 
size W, it keeps in the memory the most recent W examples. Despite being able 
to address concept drift, it is difficult to determine a priori the window size as a 
larger window is better suited for a slow drift, while a smaller window is suitable 
for a rapid drift. More sophisticated algorithms have been proposed, such as, a 
window of adaptable size or the use of multiple windows of different size [11]. 
The drawback of this approach is that it cannot handle class imbalance. 


3 Queue-Based Resampling 


Online class imbalance learning is an emerging research topic and this work 
proposes queue-based resampling, a novel algorithm that addresses this problem. 
Focussing on binary classification, the central idea of the proposed resampling 
algorithm is to selectively include in the training set a subset of the positive 
and negative examples that appeared so far. Work closer to us is [4] where the 
authors apply an analogous idea but in the context of chunk-based learning. 
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Algorithm 1. Queue-based Resampling 
1: Input: 
maximum length L of each queue 


queues CA q4) for positive and negative examples 
2: for each time step t do 
3: receive example a! € R? 
4: predict class 9° € {0,1} 
5: receive true label y* € {0,1} 
6: let z* = (x*, y!) 
7: if yf == 0 then 
8 a, = gi, ".append(z*) 


9: else 

10: q = qd, !.append(z*) 

11: end if 

12: letg’ = qp U q% be the training set 
13: calculate cost on q' using Eq. 3 
14: update classifier 

15: end for 


The selection of the examples is achieved by maintaining at any given time 
t two separate queues of equal length L € Zt, ql, = {(zi,yi)}-ı and q, = 
{(ai, yi)}£., that contain the negative and positive examples respectively. Let 
zi = (£i, yi), for any two zi, z; € qh or (q%) such that j > i, zj arrived more 
recently in time. Queue-based resampling stores the most recent example plus 
2L—1 old ones. We will refer to the proposed algorithm as Queue L. Of particular 
interest is the special case Queue; where the length of each queue is L = 1, as 
it has the major advantage of requiring just a single data point from the past. 

An example demonstrating how Queue, works when L = 2 is shown in Fig. 1. 
The upper part shows the examples that arrive at each time step e.g. 2° and 
zê arrive at t = 0 and t = 6 respectively. Positive examples are shown in green. 
The bottom part shows the contents of each queue at each time step. Focussing 
on t = 5, we can see that the queue q? contains the two most recent negative 
examples i.e. z4 and z°, and the queue a> contains the most recent positive 
example i.e. z! which is carried over since t = 1. 

The union of the two queues is then taken q! = gf U qh, = {(ai, yi) }74, to 
form the new training set for the classifier. The cost function is given in Eq. 3: 


To u Yt, hleo) (3) 


where |q'| < 2L and (xi, yi) € qt. At each time step the classifier is updated once 
according to the cost J incurred i.e. a single update of the classifier’s weights is 
performed. The pseudocode of our algorithm is shown in Algorithm 1. 

The effectiveness of queue-based resampling is attributed to a few important 
characteristics. Maintaining separate queues for each class helps to address the 
class imbalance problem. Including positive examples from the past in the most 
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Table 1. Compared methods 


Method Class imbalance | Concept drift | Access to old data 
Baseline No No No 

Cost sensitive | Yes No No 

Sliding window | No Yes Yes (W — 1) 
OOBsg Yes Yes No 

Queue, Yes Yes Yes (1) 

Queue Yes Yes Yes (2L-1) 


recent training set can be viewed as a form of oversampling. The fact that exam- 
ples are propagated and carried over a series of time steps allows the classifier to 
‘remember’ old concepts. Additionally, to address the challenge of concept drift, 
the classifier needs to also be able to ‘forget’ old concepts. This is achieved by 
bounding the length of queues to L, therefore, the queues are essentially behav- 
ing like sliding windows as well. Therefore, the proposed queue-based resampling 
method can cope with both class imbalance and concept drift. 


4 Experimental Setup 


Our experimental study is based on two popular synthetic datasets from the 
literature [2] where in both cases a classifier attempts to learn a non-linear 
decision boundary. These are, the Sine and Circle datasets and are described 
below. 


Sine. It consists of two attributes x and y uniformly distributed in [0,27] and 
[-1,1] respectively. The classification function is y = sin(x). Instances below the 
curve are classified as positive and above the curve as negative. Feature rescaling 
has been performed so that x and y are in [0, 1]. 


Circle. It has two attributes x and y that are uniformly distributed in [0, 1]. The 
circle function is given by (a — xe)? + (y — yc)? = r? where (£e, Ye) is its centre 
and re its radius. The circle with (£e, Yc) = (0.4,0.5) and re = 0.2 is created. 
Instances inside the circle are classified as positive and outside as negative. 


Our baseline classifier is a neural network consisting of one hidden layer 
with eight neurons. Its configuration is as follows: He [7] weight initialisation, 
backpropagation and the ADAM [9] optimisation algorithms, learning rate of 
0.01, LeakyReLU [12] as the activation function of the hidden neurons, sigmoid 
activation for the output neuron, and the binary cross-entropy loss function. 

For our study we implemented a series of state-of-the-art methods as 
described in Sect. 2.3. We implemented a cost sensitive version e the baseline 
which we will refer to as C'S; the cost of the positive class is set to “2 = 222 = 19 


= 0.05 
as in [14]. Furthermore, the sliding window method has been ae ome with 


504 K. Malialis et al. 


G-mean 
o 
i 


— L=50 
— L=25 
— L=10 01 
— L=1 


— L=50 
—— L=25 
— L=10 
— L=1 


o 1000 2000 3000 4000 5000 o 1000 2000 3000 4000 5000 
Time Step Time Step 


(a) p(y=1)=0.5 (b) p(y = 1) = 0.01 


Fig. 2. Effect of queue length on the Sine dataset 


a window size of W. Moreover, the OOB,, has been implemented with the time 
decay factor set to 0 = 0.99 for calculating the class size at any given time. 

For the proposed resampling method we will use the special case Queue, and 
another case Queue, where L > 1. Section 5.1 performs an analysis of Queue y, 
by examining how the queue length L affects the behaviour and performance of 
queue-based resampling. For a fair comparison with the sliding window method, 
we will set the window size to W = 2L i.e. both methods will have access to 
the same amount of old data examples. A summary of the compared methods 
is shown in Table 1 indicating which methods are suitable for addressing class 
imbalance and concept drift. It also indicates whether methods require access to 
old data and, if yes, it includes the maximum number in the brackets. 

A popular and suitable metric for evaluating algorithms under class imbal- 
ance is the geometric mean as it is not sensitive to the class distribution [16]. 
It is defined as the geometric mean of recall and specificity. Recall is defined as 


the true positive rate (R = 24) and specificity is defined as the true negative 


P 
rate (S = 4), where TP and P is the number of true positives and positives 
respectively, and similarly, TN and N for the true negatives and negatives. The 
geometric mean is then calculated using G-mean = y R x S. To calculate the 
recall and specificity online, we use the prequential evaluation using fading fac- 
tors as proposed in [3] and set the fading factor to a = 0.99. In all graphs we 
plot the prequential G-mean in every time step averaged over 30 runs, including 
the error bars showing the standard error around the mean. 


5 Experimental Results 


5.1 Analysis of Queue-Based Resampling 


In this section we investigate the behaviour of Queue y resampling under various 
queue lengths (L € [1, 10, 25, 50]) and examine how these affect its performance. 
Furthermore, we consider a balanced scenario (i.e. p(y = 1) = 0.5) and a scenario 
with a severe class imbalance of 1% (i.e. p(y = 1) = 0.01). 
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Fig. 3. Class imbalance on the Circle dataset 


Figures 2a and b depict the behaviour of the proposed method on the bal- 
anced and severely imbalanced scenario respectively for the Sine dataset. It can 
be observed from Fig. 2a that the larger the queue length the better the per- 
formance, specifically, the best performance is achieved when L = 50. It can be 
observed from Fig. 2b that the smaller the queue length the faster the learning 
speed. Queue; dominates in the first 500 time steps, however, its end perfor- 
mance is inferior to the rest. The method with L = 10 dominates for over 3000 
steps. Given additional learning time the method with L = 25 achieves the best 
performance. The method with L = 50 is unable to outperform the one with 
L = 10 after 5000 steps, in fact, it performs similarly to Queue}. 

It is important to emphasise that contrary to offline learning where the end 
performance is of particular concern, in online learning both the end performance 
and learning time are of high importance. For this reason, we have decided to 
focus on Queuess as it constitutes a reasonable trade-off between learning speed 
and performance. As already mentioned, we will also focus on Queue, as it has 
the advantage of requiring only one data example from the past. 


5.2 Comparative Study 


Figure 3a depicts a comparative study of all the methods in the scenario involving 
10% class imbalance for the Circle dataset. The baseline method, as expected, 
does not perform well and only starts learning after about 3000 time steps. 
The proposed Queues has the best performance at the expense of a late start. 
Queue, also outperforms the rest although towards the end other methods like 
OOBs, close the gap. Similar results are obtained for the Sine dataset but are 
not presented here due to space constraints. 

Figure 3a shows how each method compares to each other in the 1% class 
imbalance scenario. Both the proposed methods outperform the state-of-the- 
art OOBs,. Despite the fact that Queuez5 performs considerably better than 
Queue, it requires about 1500 time steps to surpass it. Additionally, we stress 
out that Queue, only requires access to a single old example. 
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Fig. 4. Class imbalance and concept drift 


We now examine the behaviour of all methods in the presence of both class 
imbalance and drift. Figures 4a and b show the performance of all methods for the 
Sine and Circle datasets respectively. Initially, class imbalance is p(y = 1) = 0.1 
but at time step t = 2500 an abrupt drift occurs and this becomes p(y = 1) = 0.9. 
At the time of drift we reset the prequential G-mean to zero, thus ensuring the 
performance observed remains unaffected by the performance prior the drift [15]. 
Similar results are observed for both datasets. Queues; outperforms the rest at 
the expense of a late start. Queue, starts learning fast, initially it outperforms 
other methods but their end performance is close. OOB,, is affected more by 
the drift in the Sine dataset but recovers soon. The baseline method outperforms 
its cost sensitive version after the drift because the pre-defined costs of method 
CS are no longer suitable in the new situation. 


6 Conclusion 


Online class imbalance learning constitutes a new problem and an emerging 
research topic. We propose a novel algorithm, queue-based resamping, to address 
this problem. Focussing on binary classification problems, the central idea behind 
queue-based resampling is to selectively include in the training set a subset of 
the negative and positive examples by maintaining at any given time a separate 
queue for each class. It has been shown to outperform state-of-the-art methods, 
particularly, in scenarios with severe class imbalance. It has also been demon- 
strated to work well when abrupt concept drift occurs. Future work will examine 
the behaviour of queue-based resampling in various other types of concept drift 
(e.g. gradual). A challenge faced in the area of monitoring of critical infrastruc- 
tures is that the true label of examples can be noisy or even not available. We 
plan to address this challenge in the future. 
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Abstract. We present a novel adaptive feedforward neural network for 
online learning from doubly-streaming data, where both the data vol- 
ume and feature space grow simultaneously. Traditional online learning 
and feature selection algorithms can’t handle this problem because they 
assume that the feature space of the data stream remains unchanged. 
We propose a Single Hidden Layer Feedforward Neural Network with 
Shortcut Connections (SLFN-S) that learns if a data stream needs to 
be mapped using a non-linear transformation or not, to speed up the 
learning convergence. We employ a growing strategy to adjust the model 
complexity to the continuously changing feature space. Finally, we use a 
weight-based pruning procedure to keep the run time complexity of the 
proposed model linear in the size of the input feature space, for efficient 
learning from data streams. Experiments with trapezoidal data streams 
on 8 UCI datasets were conducted to examine the performance of the 
proposed model. We show that SLFN-S outperforms the state of the art 
learning algorithm from trapezoidal data streams [16]. 


Keywords: Online learning - Trapezoidal data streams 
Feedforward Neural Networks - Shortcut connections 


1 Introduction 


Online learning makes it possible to learn in applications where the complete 
data is initially not available or the data is too large to fit into memory. Online 
learning algorithms can learn from continuously growing data, where new pat- 
terns are introduced over time. A wide range of online learning algorithms are 
available, and can be grouped into first-order and second-order methods. First- 
order methods such as [1] use first-order derivatives to minimize a loss func- 
tion. Second-order methods such as [2] exploit the second-order information to 
improve the convergence. However, second-order methods are more prone to be 
stuck at local minima and tend to be computationally costly while working with 
high-dimensional data. Traditional online learning algorithms assume that the 
feature space of the input data remains constant, and try to fit a model of con- 
stant complexity to it. However in many applications, feature spaces can grow 
over time. New features can be introduced, and combinations of new and existing 
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features can form meaningful higher order concepts. For example in social net- 
works, sets of attributes provided by each user can grow over time. In an infinite 
vocabulary topic model [15], the number of documents and the text vocabulary 
can simultaneously increase over time. In high resolution video streams, both 
the number of frames and the feature space formed by the extracted features 
can grow over time. 

Data streams where the numbers of instances and features grow simultane- 
ously are referred to as trapezoidal data streams [16]. Learning from trapezoidal 
data streams is more challenging than other online learning problems, because 
of their doubly-streaming nature. While learning from trapezoidal data streams, 
the model should be able to: 


— Learn from sequentially presented instances on a single pass, 

— Have low running time and memory complexity, 

— Adapt to increasing complexity of the feature space, and 

— Do feature selection to bound the number of features used in the model. 


Zhang et al. proposed OL gr algorithms [16] to learn a classifier from trape- 
zoidal data streams by using a passive-aggressive update rule for the existing 
features and a structural risk minimization principle for the newly introduced 
features. Also, OL gr algorithms contain projection and truncation steps to pro- 
mote sparsity and do feature selection. They do not however consider the increas- 
ing complexity of the feature space, which can be caused by various feature 
interactions. Additionally, OLsr algorithms are limited to learn linear decision 
boundaries, which are likely to perform poorly on nonlinearly separable data. 
Finally unless an additional method such as One vs. One or One vs. Rest is 
employed, OL gy algorithms can only work under binary classification settings. 

Fully connected neural networks consider complex unknown nonlinear map- 
pings of the input features and form decision regions of arbitrary shapes to make 
predictions [4]. A features value at any layer affects the values of all features at 
the next layer. Therefore, feature interactions are naturally considered. There 
are also a wide range of studies for online learning with neural networks [9]. Con- 
structive methods such as Resource Allocating Networks (RAN) [8] can adapt 
the network architecture based on the novelty of the received data in a sequential 
manner, and by adding hidden layer nodes to approximate the complexity of the 
underlying function. Minimal Resource Allocating Networks (M-RAN) [14] com- 
bine the growth criterion of the RAN’s with a pruning strategy. Their pruning 
strategy removes hidden units that consistently make little contribution, to learn 
a more compact network compared to RAN. Single Hidden Layer Feedforward 
Neural Networks (SLFN) [5] can form decision boundaries in arbitrary shapes 
with any bounded continuous nonconstant activation function or any arbitrary 
bounded activation function with unequal limits at infinities. 

Encouraged by the capabilities of the neural networks mentioned above, we 
propose SLFN-S, an Adaptive Single Hidden Layer Feedforward Neural Network 
with Shortcut Connections. Our proposed model provides growing and pruning 
capabilities to learn from trapezoidal data streams. The model learns if a linear 
mapping is enough to correctly classify the current instance, to speed up the 
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learning convergence. Unlike the existing neural network based models, SLFN-S 
is able to learn simplified linear and nonlinear decision boundaries to converge 
on a single pass, and adapt itself to the increasing complexity of the trapezoidal 
data streams. 


2 Methodology 


We consider the classification problem on trapezoidal data streams where (x+, yz) 
is the input training data, class label pair received at time t. x, € RY is a di 
dimensional vector where d; < dı+ı. Let the numbers of input and hidden layer 
nodes of the network at time t be dí and d}? respectively. Note that di, d? and 
d? represent the sizes of the feature spaces that the network operates in for the 
input, hidden and output layers. When the network receives an input (2+, Yt), 
if d; = di, the feedforward pass calculates a mixture of linear and nonlinear 
mapping for the input and makes a prediction +. If d > di, then d; — di new 
nodes for the input and hidden layers are allocated and (d; — di) - (di + d!) fully 
connected weights are initialized. Note that after this operation d+ is equal to 
di, therefore the network can make a prediction and update its weights. Finally, 
the network is pruned to bound the number of connections. Steps for training 
the proposed model are given in Algorithm 1. 


2.1 Network Architecture 


We propose a single hidden layer feedforward neural network with shortcut con- 
nections shown in Fig. 1. x} to xP represent the set of features provided by the 
instance z+. The input layer is connected to the hidden layer and the mixing 
layer with weights Win and Widentity respectively. In an unpruned network, the 
input and the hidden layers are fully connected. The hidden layer uses ReLu 
activation and is also connected to the mixing layer with Widentity. The mixing 
layer receives the output values from the input and hidden layers, then passes 
the element-wise summation of its two inputs to the next layer. The mixing layer 
is fully connected to the output layer with weights W,, which acts as a linear 
classifier. The output layer uses Softmax activation. The prediction Y; of the 
network for the instance x; can be expressed as: 


Yi = o(Wolg(Winz) + 2:)) (1) 


Note that if the ideal mapping of the input x; is H(a;), the network tries to 
approximate this mapping by learning the residual H(x;) — x, = g(W;¡nt¿) in 
the hidden layer. 

As the loss function, we use the Kullback-Leibler divergence between the 
prediction % and the true label y;. We regularize the loss function using the 
norm of the network weights. The loss for the prediction-true label pair (Ñi, yz) 
is calculated by: 


Y of log 5i + AA (2) 
ded? 


Learning Decision Boundaries from Trapezoidal Streams 511 


Algorithm 1. Network training. 


Input: 
e: Learning rate 
&: Pruning strength parameter 
A: Regularization parameter 
1 fort = 1,... T do 
2 Receive instance x; € R%, 
3 Receive label yz. 
4 if t==1 then 
5 | Initialize network with d: input and hidden layer nodes. 
6 end 
7 Let di be the input feature space of the network. 
8 if dí < d; then 
9 Allocate d; — dí input layer nodes. 
10 Allocate di — di hidden layer nodes. 
11 Randomly initialize new (d: — di) - (di + df) weights to fully connect 
new and existing nodes. 
12 end 


13 Predict the class label ĝe = o(Wo(g(Winzt) + 2+)) 
14 Calculate loss Ly = — aca? Gf log x + A(||Win||% + || Woll#)- 


15 Do a single epoch back propagation and weight update using L+ and e. 
16 Prune the network to keep the largest ¢d; weights. 
17 end 


where d? is the number of the output neurons and A is the parameter that 
controls the regularization strength and y% is the dth element of the vector y. 


2.2 Shortcut Connections 


Shortcut connections have been used in neural networks for various reasons. [10] 
uses shortcut connections to model linear dependencies and separate the learning 
of linear and non-linear parts of the mapping. [13] uses shortcut connections 
to decompose the network into biased and centered subnets, and train them 
simultaneously. [11] uses shortcut connections to center the input, hidden layer 
activations and error signals to improve the learning speed. [12] address vanishing 
and exploding gradients with shortcut connections. Finally, [3] uses shortcut 
connections to ensure that deeper layers do not make worse mappings than their 
shallower counterparts. 

While learning from data streams, because there is no bound to the number 
of instances received, the learner can only do a single pass over each instance. 
Therefore, the model complexity should be low enough to converge in a single 
pass, and high enough to extract useful patterns of various complexity from the 
data. For a classification problem, these constraints can be associated with the 
complexity of the decision boundary. We use shortcut connections to condition 
the network to learn linear decision boundaries, unless a nonlinear mapping is 
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Hidden Layer 


Input Layer 


Fig. 1. The network architecture. 


necessary. At each feedforward pass, the mixing layer outputs the summation 
of input x; and the nonlinear mapping g(W;nx:). Then, the output layer uses 
this summation to make a linear classification. If a linear decision boundary is 
enough for the instance z+, output of the hidden layer g(W;„nx:) will be equal 
to zero. As a result, the input itself will be passed to the linear output layer for 
prediction. Else, a mixture of the input and its nonlinear mapping will be used. 
This mechanism helps the network to use simpler decision boundaries by forcing 
the hidden layer to learn a residual mapping on top of the input features, instead 
of learning a completely new mapping. 


2.3 Growing and Pruning 


While learning from data streams, the data volume continuously grows with- 
out an upper bound. Therefore, the learning process must be fast and memory 
efficient. Also because the learning is incremental, the actual complexity of the 
decision boundary is unknown. A network with too few trainable parameters 
will not be able to capture the underlying function from which data is being 
generated. On the other hand, a network with too many trainable parameters 
will overfit [6]. While learning from trapezoidal data streams the task is more 
challenging, because the feature space and the data volume grow simultaneously. 
As the feature space grows, interactions of existing and new features generate 
new higher order features that can be useful for classification. Therefore, the 
interaction of new and existing features need to be considered. Also, the new 
features introduced in the data stream can be irrelevant or redundant. If the 
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network keeps growing without any feature selection mechanism, it will most 
likely overfit and have a poor generalization ability. Moreover, the running time 
complexity of the model will be high. 

To address these issues, we introduce growing and pruning mechanisms to 
the proposed model. When a training instance x; with a higher dimension d 
arrives, the network allocates d; — dí input and hidden layer nodes and randomly 
initializes (d; — di) - (di + d?) weights in a fully connected manner. Allocation 
of the new input nodes ensures that the network can use the new features and 
consider their combinations with the existing features. Moreover, the allocation 
of the new hidden layer nodes increases the learning capacity of the model. 
Therefore, it helps the network to adjust itself to the increasing classification 
complexity by the growing feature space of trapezoidal data streams. 

The network is trained using Kullback-Leibler divergence loss with weight 
penalty, shown in Eq. 2. This forces the network to learn small weights for less 
important connections. After each growing and weight update step, the network 
is pruned to keep only the largest O(d;) connections. The number of connections 
to be preserved is calculated by dd;, where 0 < $ < 1 is a parameter that 
controls the pruning strength. This aggressive pruning strategy ensures that the 
number of trainable parameters in the network is linearly bounded by the size 
of the input feature space. 


3 Experiments and Results 


We empirically evaluate the performance of the proposed method SLFN-S on 
trapezoidal data streams. We first compare the accuracy of SLFN-S with a sin- 
gle hidden layer neural network without shortcut connections, which will be 
referred to as SLFN, to show that the shortcut connections help to simplify the 
decision boundary and improve the convergence. Note that SLFN has the same 
growing and pruning capabilities as SLFN-S for the sake of fairness. Then, we 
compare the accuracy of SLFN-S with the state of the art learning algorithm 
for trapezoidal data streams OL gp [16]. We use 8 UCI datasets from [16] and 
simulate trapezoidal data streams by splitting each dataset into 10 chunks such 
that the number of features included by each chunk increases. For example, the 
instances in the first chunk have the first 10% of features, instances in the sec- 
ond chunk have the first 20% of features and so on. The numbers of instances, 
features and the parameter setting used for each dataset are listed in Table 1. 
We use 20-fold cross validation on random permutations of the datasets and 
measure the average error rate. Parameters are chosen with cross validation. We 
use ADAM [7] to update the network weights. 

Figure 2 shows the mean number of incorrect predictions made by SLFN- 
S, SLFN and OL gr over 20 folds for each dataset, with standard error. Several 
observations can be drawn. First, SLFN-S has lower error rates than SLFN in all 
8 UCI datasets, because SLFN needs more iterations over the data to converge. 
This is because SLFN-S tends to learn simpler decision boundaries and use non- 
linear mappings only when needed. Second, SLFN-S significantly outperforms 
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Table 1. Number of samples, features and parameters used for each UCI dataset. 


Dataset #Samples | #Features | € @ |r 

wbc 699 10 0.75 1 102 
wpbc 198 34 0.85 0.8 | 0.1 
wdbc 569 31 0.05 1 [0.01 
german 1,000 24 0.05 1 |0.1 
ionosphere 351 35 0.15 1 | 0.05 
svmguide3 | 1,234 21 0.1 1 [0.1 
magic04 (19,020 10 0.02 1 |0.1 
asa 32,561 123 0.05 0.5 | 0.1 


45 


u 


percentage 


8 


percentage 


WBC Svmguide3 WOBC lonosphere 


Fig. 2. Mean number of incorrect predictions for SLFN, SLFN-S and OL gr algorithms. 


SLFN in 6 of the 8 UCI datasets. For WDBC and WPBC, SLFN-S has fewer 
errors than SLFN but the difference is not significant. WPBC is nonlinearly 
separable, which can also be observed from the performance difference between 
the linear OL gp and nonlinear neural network based models. Therefore, simpli- 
fication of the decision boundary made by SLFN-S did not significantly improve 
the convergence of SLFN. On the other hand WDBC is highly linearly separa- 
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Fig. 3. Change of error rates of SLFN-S and OLsr algorithms in trapezoidal data 
streams. 
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ble, therefore does not require learning of nonlinear mappings. This can also be 
observed in the Fig. 3, where SLFN-S and OL gr show similar trends. As a result 
both SLFN and SLFN-S achieve similar accuracy. These results verify that the 
shortcut connections improve the convergence by allowing the network to learn 
simpler decision boundaries. Figure3 shows the changing error rate of OLgr 
and SLFN-S with respect to the instances received. We observe that SLFN-S 
converges faster and has higher accuracy than the OLsr algorithm in all of the 
8 UCI datasets. This is because OL gy considers input features independently 
and does not take feature combinations into account. On the other hand, SLFN- 
S explores new useful feature combinations and prunes the weights that do not 
have significant contribution. Moreover, SLFN-S can learn non-linear decision 
boundaries when it is needed. The experiment results show that our proposed 
method SLFN-S significantly outperforms the state of the art OL gp algorithms. 
The running time of the both SLFN-S and OL gr algorithms scale linearly with 
respect to the number of input features. 


4 Conclusion 


This paper proposed a Single Hidden Layer Neural Network with Shortcut Con- 
nections (SLFN-S) and showed that the proposed method significantly outper- 
forms the state of the art OL gr. SLFN-S provides a growing mechanism to adapt 
itself to the increasing complexity of the trapezoidal data streams. Moreover, the 
proposed model uses a pruning mechanism to ensure that the complexity of the 
network linearly scales with respect to the size of the input feature space. We 
compared the performance of the proposed model with SLFN without any short- 
cut connection, and the state of the art learning algorithm for trapezoidal data 
streams OL gr [16]. We showed that the shortcut connections help the network 
to learn simpler decision boundaries and converge faster. We also showed that 
the proposed method significantly outperforms OLgr. Note that unlike OLsr, 
SLFN-S does not have an active feature selection mechanism. OL gp uses one 
weight for each input feature. At each iteration, it projects its weights and trun- 
cates a portion of the smallest weights. Therefore, it stops using the features 
that are associated with the truncated weights. On the other hand, SLFN-S uses 
multiple weights per feature. A feature is removed only if all weights associated 
with that feature are pruned. Therefore, OLsr is capable of learning sparser 
solutions than SLFN-S. 

Future work includes conducting extensive experiments on larger datasets. 
Another future direction is to add a feature selection policy, and trainable weights 
for the shortcut connections to learn the ratios of mixtures for linear and non- 
linear mappings. 
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Abstract. If label information in a classification task is expensive, it can 
be beneficial to use active learning to get the most informative samples 
to label by a human. However, there can be samples which are mean- 
ingless to the human or recorded wrongly. If these samples are near the 
classifier’s decision boundary, they are queried repeatedly for labeling. 
This is inefficient for training because the human can not label these 
samples correctly and this may lower human acceptance. We introduce 
an approach to compensate the problem of ambiguous samples by exclud- 
ing clustered samples from labeling. We compare this approach to other 
state-of-the-art methods. We further show that we can improve the accu- 
racy in active learning and reduce the number of ambiguous samples 
queried while training. 


Keywords: Active learning - Ambiguous samples - Certainty 
Rejection - Clustering 


1 Motivation 


User-adaptable learning systems, who are post-trained by the user have the 
advantage, that they can adjust to new circumstances or improve towards a 
user-specific environment. In a classification system the samples can be trained 
incrementally and labeled by the user. Active learning [10] is an efficient training 
technique, where the samples which are predicted to deliver the highest improve- 
ment for the classifier are chosen for labeling by a human. 

Whenever the user is involved, the system has to make sure that interaction 
and training is efficient. A user often feels bored with labeling tasks, therefore the 
learning system should limit the number of actions and they should be solvable 
for the human to not annoy him and instead make him feel comfortable and 
meaningful in his role as interaction partner. To know the time when the learning 
system needs advice, it is necessary to predict the competence of the learning 
system, which we demonstrated in our recent contribution [6] with respect to a 
classifier’s accuracy in pool-based incremental active learning. However, on the 
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other side the human teacher can also have limited competence to fulfill his task 
in an oracle role. 

In most active learning approaches the oracle is expected to have perfect 
domain knowledge [11]. But in many real world applications a perfect oracle is 
not realistic because there can be samples resulting from noisy recordings like a 
dirty camera or bad light conditions. Also a specific oracle might not know the 
labels for specific samples because it can not identify them. 

Our goal in this contribution is, that the learning system should adapt to the 
human weaknesses and adapt its strategy of interacting as a good cooperation 
partner. Related to active learning that means, rather than forcing the human 
to give uncertain answers, we want to give him the opportunity to reject the 
samples he is uncertain about. 

There are diverse approaches in the literature for handling uncertainty in 
labeling. Much research was done on active learning with noisy labels or with 
labels from multiple oracles [15]. However in our task setting the robot is intended 
to have access to only one oracle. Käding et al. [5] proposed an approach for their 
Expected Model Output Change (EMOC) model that adds uncertain samples 
in one error class. However, their method only works with EMOC and is directly 
integrated into the classifier. A similar approach was done by Fang et al. [3]. 
They train a classifier that should distinguish certain and uncertain objects. 
However, in their evaluation they have clustered the data in three clusters and 
define two of them as ambiguous, which is too simplistic and does not model 
a real world task. The problem with classifier-based solutions for finding and 
rejecting ambiguous samples is that they, according to our experiments, can not 
generalize well in highly complex scenarios like the one we are facing. In our 
application scenario, a service robot acts in a garden environment [7], mows 
the lawn and records the garden and occurring objects by a camera. However, 
because occurring objects are diverse, there is no clear concept between recog- 
nizable and ambiguous samples in the feature space, making it hard to train i.e. 
a secondary classifier to separate them, as is shown in the experiment section. 

We show that a more local method is better able to adapt to this distributed 
ambiguous samples and therefore we introduce Density-Based Querying Exclu- 
sion (DBQE), a lightweight clustering-based approach which finds ambiguous 
clusters and excludes them from querying in active learning. Our approach does 
not inhibit exploration of unknown classes, and can be stacked up to any exist- 
ing active learning model and every querying technique. We evaluate it using a 
challenging outdoor data set (Fig.1). 


2 Active Learning 


In pool-based active learning there is a labeled set £ and an unlabeled set U. 
The active training of a classifier C starts with an empty or small £. The learner 
C can choose which samples from U should be labeled by a so-called oracle 
(which is often a human) and added to £. This is called querying and there 
are a variety of approaches to find the best samples to query [11]. An often 
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used querying technique is uncertainty sampling [1] which queries the samples 
with the least certainty for labeling. Other strategies select samples based on 
the expected model output change [5], or they consider a committee of different 
classifiers [12] for choosing the samples to be queried. C is then trained in an 
incremental fashion or again from scratch on £. 


Fig. 1. Images from the outdoor object recognition benchmark [7,8]: The upper row 
images are labeled as recognizable and the bottom row as ambiguous. Objects like the 
basketball or the leaves are recognizable from every angle. The car is recorded in its 
canonical view, opposed to the blue duck which is ambiguous from this perspective. 
There are also views of different objects which are hardly distinguishable, like an apple 
(bottom center) and a tomato (bottom right). 


3 Density-Based Querying Exclusion 


We introduce Density-Based Querying Exclusion (DBQE) which clusters 
ambiguous samples and prevents them from querying by excluding them from 
U. Our assumption is that ambiguous samples are located in clusters which 
can occur in a variety of places in the feature space. Density-based clustering 
approaches showed to be versatile and deliver good performance while at the 
same time are robust with handling outliers [2]. Another advantage is that the 
number of clusters does not have to be known in advance. This is important in 
particular because in our case we want to find only one cluster at a time, while 
there can be any number of clusters in the data set. 

The training procedure of an active learning classifier using DBQE is illus- 
trated in Algorithm 1. 
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Algorithm 1. Active learning with Density-Based Querying Exclusion (DBQE) 


Require: maxPts > do clustering on max. Pts points nearby te 

Require: minPts > minimum number of neighbors to be a core sample 

Require: e > distance range describing a sample’s neighborhood 
1: U — load_data() > unlabeled data 
2: L {} > labeled Set is empty 
3: C — initialize_classi fier () > initialize active classifier 
4: while not C.is_trained() do 
5 s — C.query_next_sample(U) >œ querying using uncertainty sampling 
6: l — ask_for_label(s) > ask oracle for supervision 
7 if l.is_ambiguous() then > oracle labeled s as ambiguous 
8 c— DBQE(s, minPts, maxPts, €) > DBQE clustering is applied 
9: U-U\c > found cluster c is excluded from U 
10: else > s is not ambiguous and oracle labeled it 
11: C.train(s, L) > classifier C is trained with new sample s and label | 
12: end if 

13: end while 

14: 

15: function DBQE(x., minPts, max Pts, e) 

16: ve {} > visited samples 
17: c— {xe} > samples considered to be in cluster 
18: t<— {xe} > samples to be processed 
19: R — get_samples_nearby(U, x.,maxPts) > get maxPts nearest samples to x. 

20: for a € t do 

21: if not a € v then > if a was not visited before 

22: v=wvUa > mark a as visited 

23: n — region_query(a, e) > find neighborhood points 

24: if n.size() > minPts then > if a is a core sample 

25: c=cuUa > add a to cluster set c 

26: t=tUn > add n to t 

27: end if 

28: end if 

29: t-t\a > remove a from queue t 

30: end for 

31: return c > return ambiguous cluster c 

32: end function 

33: 

34: function region_query(s, e) > returns samples from R within range e to s 

35: n 1) 

36: for ¿€ R do 

37: if |i — s| < e then > sample ¿ is within e range 

38: n-nVi > i is added to set n 

39: end if 

40: end for 

41: return n > samples in neighborhood are returned 


42: end function 
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The active learning is applied as usual: First the query strategy selects a 
sample and the oracle is asked for a label. If it can provide it, the classifier is 
trained, otherwise our DBQE approach is applied which does a region growing 
to find the cluster containing the queried ambiguous sample, which we call ze. 
In the clustering function we select a subset of samples R C U which are the 
nearest samples to x. for speed improvements and to limit the maximum number 
of excluded samples, denoting maz Pts as the number of points in R. The region 
growing is applied similar to DBSCAN [2], also illustrated in Fig.2. DBSCAN 
iteratively applies this region growing until the whole data set is clustered. There 
are two parameters involved: e is a distance range describing an arbitrary sam- 
ple’s neighborhood points. The other parameter to choose is min Pts which is the 
minimum number of samples in a sample’s neighborhood for the sample to be a 
so-called core sample, otherwise it is an outlier. The main idea is to expand a 
cluster c around the ambiguous sample ze. The cluster samples in c are excluded 
from U. 

If there is no cluster containing x. (so x. itself is an outlier) DBQE is only 
excluding x. from U. 


Illustration of cluster expansion 
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Fig. 2. Illustration of DBQE: the points represent samples from the unlabeled subset 
R CU with the number of samples maxPts = 14. Blue points (circles) are samples not 
visited, visited points in v are displayed orange (half circles) and points determined as 
part of the ambiguous cluster c are in red (peaked circles), outliers in gray (pacman 
shape). The progress of the region growing is displayed with the minimum neighbor- 
hood size minPts = 3. The oracle defines x. as ambiguous and in the first step x. is 
determined as a core sample. The cluster is expanded, finding the second core sample 
in step 2. In step 3, an outlier is found, which is not included into the cluster. The final 
clustering result is displayed on the right. (Color figure online) 


4 Evaluation 


We evaluated our method together with some baseline methods on our outdoor 
data set [7] because it provides a real application benchmark of high difficulty 
[6-8]. The data set is an image data set consisting of 50 object classes. The 
objects are laying on the lawn and were recorded by a mobile robot in a way 
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that the robot approaches the object and makes ten consecutive pictures each 
approach. In total each object has ten approaches with ten images each, sum- 
ming up a total of 5000 images. Some objects can be hard to distinguish due to 
unfavorable viewing angle. Also there are some objects that look rather similar 
like an apple, onion, tomato, orange and ball or e.g. several rubber ducks. A 
feature representation of each image is extracted with the VGG16 deep convo- 
lutional net [13] trained on images from the imagenet competition. We removed 
the last softmax layer and using the outputs of the penultimate layer as a 4096 
dimensional feature vector. There can be approaches or partial approaches of 
an object, from which the object images can be ambiguous for a human. We 
annotated this ambiguity property for our data set (compare to Fig. 1). In total 
we annotated 24% of the images as ambiguous, a selection of recognizable and 
ambiguous images can be seen in Fig. 1. For evaluation a 50/50 train-test split 
was done. The data was split by approaches, so that the images of a single 
approach are either completely in the train or in the test set. We repeated the 
experiment 15 times to average our results. As a classifier we chose Generalized 
Learning Vector Quantization (GLVQ). GLVQ has proved to be an accurate 
classifier in incremental learning [8] and is also suitable for active learning with 
uncertainty sampling [6]. 

DBQE needs the parameters minPts and e to be set to a suitable value. To 
have a better idea how the data is clustered, a look at unsupervised statistics 
related to the distances to neighboring samples can help. We achieved good 
results with many parameter combinations but we also applied a grid search 
where we defined ranges of minPts and e values and tested all combinations of 
those. There we found out e = 35, minPts = 3 and maxPts = 20 give best 
results for our evaluation on the outdoor data set. For training and evaluating 
an active learning classifier we developed the framework ALeFra! in context of 
this paper. By using it any offline and incremental classifier can be converted to 
an active classifier. There are also basic querying techniques implemented and 
the user can visualize the progress of the training with a few lines of code. There 
is a visualization of the feature space which uses a dimensional reduction like 
t-SNE [9] or MDS [14] and if the data consists of images, they are visualized in 
a collage which is created after each batch while training. 

We investigate three approaches and compare them to simple baselines: 


— Classifier: The problem can be represented as a binary classification task, 
predicting whether samples are recognizable or not [3]. The classifier is trained 
with all yet queried recognizable and ambiguous samples. We evaluated the 
classifiers GLVQ, kNN, logistic regression and SVM, where the kNN outper- 
formed the others. This may occur because a local model like KNN can better 
adapt to the ambiguous samples, who may be diverse in feature space. Also 
we have observed, that if using classifier’s confidence information of predicted 
samples can improve performance and exploring new classes in U. Therefore 
we make use of a certainty value of the kNN-classifier, which uses distance 
information of the winning and loosing classes defined in [6]. Only samples 


1 https: / /github.com/limchr/ALeFra. 
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who are classified as ambiguous with a certainty value greater than a prede- 
fined threshold are avoided in querying. We tuned this threshold to the best 
performance for our evaluation on the outdoor data set. 

— Rejection: The problem can be represented as a rejection task, where some 
samples are rejected from querying. Therefore we implemented a local rejec- 
tion approach [4] for the GLVQ-classifier. Here every prototype has a rejection 
threshold which is set to zero at beginning. If an ambiguous sample is queried, 
the winning prototype’s threshold is adjusted to d*a, where d is the certainty 
of the ambiguous sample and a is a parameter that can be tuned. Only those 
samples are considered for querying, for which the distance d to their winning 
prototype is higher than the threshold of that particular prototype. 

— Clustering: The problem can be represented as a clustering task. DBQE is 
using density-based clustering to represent ambiguous samples. We also tried 
to apply silhouette analysis, but density-based clustering results in higher 
accuracy in finding the ambiguous clusters and additionally it is very fast to 
expand a cluster and it can also detect outliers. 


Accuracies of strategies 
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Fig. 3. Evaluation on the outdoor data set: test-accuracies (y-axis) of all approaches 
vs. number of queried samples (x-axis). 


We also implemented the following two baseline strategies for comparison: 


— Mark: If an ambiguous sample is queried, it is marked as ambiguous and is 
not considered in future queryings. This baseline strategy can be seen as a 
naive approach for handling ambiguous samples. 

— Prediction: If an ambiguous sample is queried, the classifier predicts its 
label and uses this for training. With this baseline we want to determine if 
the classifier itself is able to classify the samples that the human rejected as 
ambiguous. 


Improving Active Learning by Avoiding Ambiguous Samples 525 


Figure 3 shows the test-accuracies of the strategies for active training. DBQE 
and classifier are the two strategies with the highest accuracy where DBQE is 
better in the middle stage of the training. Reject is slightly better than mark, 
where at the end of training, both are converging to DBQE and classifier. Predic- 
tion is significantly worse than the other approaches, indicating that the classifier 
is not accurate at predicting those labels that the human can not provide. 
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Fig. 4. Number of queried ambiguous samples during training. Each bin of the his- 
tograms represents the number of ambiguous samples queried, pooled in bins of 16 
queryings giving a total of 50 bins. The number of ambiguous samples is displayed 
on the y-axis and the number of queries on the x-axis. Please note that the baseline 
strategy prediction is not represented here because it is using ambiguous samples for 
training. 


DBQE is slightly better than classifier in terms of accuracy while training. 
However, another important objective was to minimize human frustration and 
to make him feel comfortable in his role. Therefore we visualized the number 
of ambiguous queried samples while training. In Fig. 4 it can be seen that sig- 
nificantly fewer samples are queried using DBQE. After 400 trained samples, 
ambiguous samples are queried only occasionally. The querying of ambiguous 
samples using classifier only drops slowly and especially in the earlier stage of 
training is significantly higher than DBQE. Mark is querying the most ambigu- 
ous samples compared to DBQE and classifier. To better visualize the total num- 
ber of ambiguous queried samples, we plotted the cumulative sum of ambiguous 
queried samples in Fig.5. DBQE is capable of querying approximately three 
times less ambiguous samples than classifier and five times less than reject and 
mark. 
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Fig. 5. Cumulative sum of queried ambiguous samples during training. 


5 Conclusion 


We showed that it is possible to efficiently exclude ambiguous samples from 
active learning. In our challenging outdoor object recognition setting, where 
ambiguous samples were distributed over the whole feature space, DBQE is able 
to improve the accuracy in active learning and further reduces the amount of 
meaningless queries significantly. We implemented and evaluated a variety of 
other approaches in depth and compared them to DBQE in a realistic setting. 

We think that DBQE can be used to model human capabilities and signif- 
icantly improve robot acceptance as a cooperation partner. To prove this as a 
next step we want to integrate DBQE in a robotic application and investigate a 
larger number of benchmarks. 
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Abstract. We consider the task of predicting the solar power output for the 
next day from previous solar power data. We propose EN-meta, a meta-learning 
ensemble of neural networks where the meta-learners are trained to predict the 
errors of the ensemble members for the new day, and these errors are used to 
dynamically weight the contribution of the ensemble members in the final 
prediction. We evaluate the performance of EN-meta on Australian solar data for 
two years and compare its accuracy with state-of-the-art single models, classical 
ensemble methods and EN-meta versions without the meta-learning component. 
The results showed that EN-meta was the most accurate method and thus 
highlight the potential benefit of using meta-learning for solar power 
forecasting. 


Keywords: Solar power - Dynamic ensembles - Neural networks 
Meta-learning 


1 Introduction 


Solar energy is a clean and renewable source of electricity. Its use is rapidly growing 
due to the improved efficiency and reliability of PhotoVoltaic (PV) solar panels and 
their reduced cost. However, the generated solar power is highly variable as it depends 
on the solar irradiance and other meteorological factors, which makes its large-scale 
integration in the power grid more difficult. This motivates the need for accurate 
prediction of the produced solar power, in order to ensure reliable electricity supply. 

In this paper we consider the task of predicting the PV power output for the next 
day at half-hourly intervals using only previous PV data. The other commonly used 
data source is weather information, however reliable weather measurements and 
forecasts are not always available for the PV site. Recent studies [1, 2] investigating the 
use of previous PV data only have shown promising results and in this paper we also 
consider univariate prediction. Specifically, given a time series of PV power outputs up 
to the day d: [P!,..., P“], where P' is a vector of half-hourly power outputs for day i, 
our goal is to forecast P“+!, the half-hourly power output for day d + 1. 

Different approaches for PV power forecasting have been proposed, e.g. using 
statistical methods such as linear regression and autoregressive moving average [1], or 
machine learning methods such as Neural Networks (NN) [1, 3, 4], Support Vector 
Regression [5] and k-Nearest Neighbor (KNN) based methods [1, 3, 6]. Ensembles 
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combining the predictions of several models have also been investigated for solar 
power and other time series forecasting tasks and shown to be very competitive [7-9]. 

In [8] we developed an ensemble of NNs for PV power forecasting that tracks the 
error of the ensemble members on previous data and uses this error to determine the 
weights of the ensemble members for the prediction for the new day. In this paper, 
motivated by [9], we investigate a different approach that uses meta-learning to predict 
the error of the ensemble members for the new day and calculate their weights based on 
their predicted error, rather than on their errors on previous days. Thus, the idea is to 
adapt the ensemble to the characteristics of the new day by selecting and combining the 
most appropriate ensemble members, the ones with the most suitable expertise, esti- 
mated based on their predicted error. In summary, the contributions of this paper are: 


1. We propose EN-meta, a new dynamic ensemble combining NNs. It uses meta- 
learners to predict the error of each ensemble member for the new example, and 
based on it to determine the contribution of the ensemble member in the final 
prediction. 

2. We investigate four strategies for determining the weights of the ensemble members 
based on their predicted errors and consider two different types of meta-learners. 

3. We conduct an evaluation using Australian PV data for two years and compare the 
performance of EN-meta with a single NN, SVR, KNN and persistence baseline, 
classic ensembles (bagging, boosting and random forest) and two EN-meta versions 
without the meta-learning component. The results demonstrate the effectiveness of 
EN-meta and the potential of meta-learning methods for solar power forecasting. 


2 Data and Experimental Setup 


Data. We used PV power data for two years, from 1 January 2015 to 31 December 
2016, for 10 h during the daylight period: from 7am to 5 pm. The data comes from a 
rooftop PV plant located at the University of Queensland in Brisbane, Australia, and is 
available from http://www.uq.edu.au/solarenergy/. 

The original PV power data was recorded at 1-min intervals. As our task is to 
predict the PV power at 30-min intervals, the raw 1-min data was aggregated to 30-min 
data by averaging the values in the 30-min intervals. The data was also normalized to 
[0, 1]. The small number of missing values (0.02%) were replaced before the aggre- 
gation using a nearest neighbor method as in [8]. Hence, our dataset contains 14,620 
values in total (= (365 + 366) days x 20 values). 


Data Sets. The PV data was split into three subsets: (1) training - 70% of the 2015 
data, used for model training; (2) validation - the remaining 30% of the 2015 data, used 
for parameter selection and (3) testing - the 2016 data, used to evaluate the accuracy. 


Evaluation Measures. We used two performance measures: Mean Absolute Error 
(MAE) and Root Mean Squared Error (RMSE): 
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where Pi and P' are the vectors of actual and predicted half-hourly PV power outputs 
for day i, N is the number of days in the testing set and n is the number of predicted 
power outputs for a day (n = 20). 


3 Dynamic Meta-Learning Ensemble 


There are three main steps in creating the dynamic meta-learning ensemble EN-meta as 
shown in Fig. 1: (i) training ensemble members, (ii) training meta-learners and 
(iii) calculating the weights of the ensemble members for the prediction of the new 
example. 

We train the ensemble members to predict the PV power for the next day and their 
corresponding meta-learners (one for each ensemble member) to predict the error of 
this prediction. Thus, each meta-learner learns to predict how accurate the prediction of 
the ensemble member will be for the new day based on the characteristics of the day. 
The predicted errors are converted into weights (higher weights for the more accurate 
ensemble members and lower for the less accurate) and the final prediction is given by 
the weighted average of the individual predictions. 
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Fig. 1. Structure of EN-meta 


3.1 Training Ensemble Members 


Figure 2 illustrates the training of the ensemble members. The ensemble consists of 
S NNs. Effective ensembles include diverse ensemble members [10]. We generate 
diversity using two strategies - random example sampling and random feature sam- 
pling, using the method from [8] which was shown to perform well. 
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Fig. 2. Training ensemble members 


Random Example Sampling: We create S bootstrap samples, one for each NN, using 
random sampling with replacement and a pre-defined example sampling rate Rs. Bach 
sample contains only Rs% of the d examples for the first year, which is the whole data 
used for training and validation. These examples are then randomly divided into 
training set (70%, used for training of the NN) and validation set (30%, used for 
selecting the NN parameters). Thus, the training set for a single NN will contain a 
smaller number of examples than the original training set and will have the same 
number of features. The best Rs was selected by experimenting with different values 
and evaluating the performance on the validation set (Rs best = 25%). 


Random Feature Sampling: The S training sets from the previous step are filtered by 
retaining only some of their features and discarding the rest. This is done by using 
feature sampling with replacement with a pre-defined sampling rate Rf. We split the 
S training sets into three parts and applied Rf7 = 25%, Rf2 = 50% and Rf3 = 75% for 
each third. 

A single ensemble member is aNN with finput neurons (f < 20), corresponding to 
the sampled features (PV power ofthe previous day), and 20 outputs, corresponding to all 
20 PV values ofthe next day. Ithad one hidden layer where the number of neurons was set 
to the average of the input and output neurons, and was trained using the Levenberg- 
Marquardt version of backpropagation algorithm. We combined S = 30 NNs. 


3.2 Training Meta-Learners 


Every ensemble member NN; has an associated meta-learner ML;, which is trained to 
predict the error of NN; for the new day. Thus, ML,, takes as an input the PV data for 
day d and predicts the forecasting error of NN; for day d + 1. This error is then 
converted into a weight for NN; and used in the weighted average vote combining the 
predictions of all ensemble members. 

The motivation behind using dynamic ensembles is that the different ensemble 
members have different areas of expertise, with their performance changing as the time 
series evolves over time. We can learn to predict the error of an ensemble member for 
the next day based on its prior performance. Then we can use these predicted errors to 
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weight the contributions of the ensemble members in the final prediction, so that 
ensemble members that are predicted to be more accurate are given higher weights. In 
this way we match the expertise of the ensemble members with the characteristics of 
the new day and adapt the ensemble to the changes in the time series. 

We implemented and compared two sets of meta-learners: NN and kNN. Both sets 
contain S meta-learners, one for each ensemble member. Each meta-learner was trained 
to predict the MAE of its corresponding ensemble member for the next day. 


NN Meta-Learners. To train a NN meta-learner ML; for ensemble member NN;, we 
firstly need to create the training data for it, and in particular to obtain the target output. 
Using the trained ensemble member NN;, we obtain its prediction for all examples from 
the training set; the input is P“, the PV power vector of the previous day d but con- 
taining only the f sampled features, and the output is P’*+!, the PV power vector for the 
next day d + 1 containing all 20 values. We then calculate MAE**!, the error for day 
d+ 1. A training example for ML; will have the form: [P?, MAE“? }, where P* is the 
input vector (containing the same f features as NN;) and MAE"*! is the target output. 
Thus, the NN meta-learner has f input and 1 output neurons. We again used 1 hidden 
layer and the same rule for the number of hidden neurons as for the ensemble members. 


kNN Meta-Learners. In contrast to the NN meta-learners, there is no need to pre-train 
the KNN meta-learners as the computation is delayed till the arrival of the new day. 
Specifically, to build a KNN meta-learner for ensemble member NN; for the new day 
d + I, the PV data of the previous day d is collected and processed by selecting the 
same subset of features fas for NN;. Then, the training set is searched to find the k most 
similar days to day d in terms of the f features. The errors (MAE) of the NN; for the 
days immediately following the neighbors are calculated and averaged to calculate 
MAE**!, the predicted error of ensemble member NN; for day d + 1. To select the 
value of k, we experimented with k from 5 to 15, evaluating the performance on the 
validation set; the best k was 10 and it was used in this study. 


3.3 Weight Calculation and Combination Methods 


The predicted errors of the corresponding meta-learners for each ensemble member 
need to be converted into weights for the ensemble members. We investigated two 
strategies for calculating the weights: linear and nonlinear. 


Linear. The weight of ensemble member NN; for predicting day d + 1 is calculated 
as: 


norm 
i 


da l—e 
N norm 
u(t — & ) 


where e;”” is the predicted error for NN; for day d + 1 by its corresponding meta- 


learner ML;, normalised between 0 and 1, and j is over all S ensemble members. 
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It is necessary to use 1 — e}°™ and not e;””as lower errors should be associated 
with higher weights and vice versa. The denominator ensures that the weights of all 
ensemble members sum to 1. 


Non-linear. The weight of ensemble member NN; for predicting day d + 1 is calcu- 
lated as a softmax function of the negative of its predicted error e; for day d + 1: 


14 epa) 


' Di exp(-e;) 


where e; is the predicted error for NN; by its corresponding meta-learner ML;, j is over 
all S ensemble members and exp denotes the exponential function. 


Ensemble Member Combination. The final prediction of EN-meta is calculated by 
the weighted average of the predictions of the individual ensemble members: 
Bee riyari, 

In addition to combining the predictions of all ensemble members, we also con- 
sidered combining only the M best ensemble members, based on their predicted error. 
To select the best M, we experimented with M = 1/3, 1/2 and 2/3 of all ensemble 
members (30 in our study), evaluating the performance on the validation set. 

Hence, there are four strategies for combining the individual predictions — linear vs 
non-linear weight calculation and combining all vs only the best M ensemble members. 


4 Methods Used for Comparison 


We compared EN-meta with three groups of methods: (i) single methods: NN, SVR, 
k-NN and a persistence model; (ii) classical ensembles: bagging, boosting and random 
forest; and (iii) static and dynamic versions of EN-meta without meta-learners. 


4.1 Single Models 


NN. An NN with one hidden layer of m nodes, where m was the average of the input 
and output nodes. It takes as an input the 20 half-hourly PV power data of the previous 
day d and predicts the 20 half-hourly PV data for day d + 1. 


SVR. The SVR model is similar to the NN model, except that we train 20 SVRs, each 
predicting one of the 20 half-hourly value for the next day d + 7. All SVRs take as an 
input the 20 half-hourly PV values of the previous day d. 


kNN. To forecast the PV power data of day d + 1, kNN firstly finds the k nearest 
neighbors of day d - the days from the training set with the most similar PV power 
using the Euclidean distance. To compute the predicted PV power output for day d + 1, 
it then finds the days immediately following the neighbors and averages their PV 
power. 


Persistence (P). As a baseline, we developed a persistence model which uses the PV 
power output of day d as the forecast for day d + 1. 
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4.2 Classical Ensembles 


We also implemented the regression tree based ensembles Bagging (Bagg), Boosting 
(Boost) and Random Forest (RF). For consistency with the proposed ensemble, the 
number of trees in Bagg, Boost and RF was set to 30. As regression trees cannot predict 
all 20 values for the next day simultaneously, a separate ensemble is created for each 
half-hourly value, as in the SVR model. Thus, we create 20 ensembles of each type. 


4.3 Static and Dynamic Ensembles Without Meta-Learners 


To assess the contribution of the meta-learning component, we also compare EN-meta 
with two versions of this ensemble without meta-learning: static and dynamic. 

The static ensemble is EN-meta without the meta-learning component and using the 
average of the individual predictions to form the final prediction. We refer to this 
ensemble as EN-static. 

The dynamic ensemble is an extension of EN-static; it uses weighed average for 
combining the individual predictions. The weighs of the ensemble members are cal- 
culated based on their previous performance (error) in the last D days. We used the 
total MAE error, over the previous 7 days. The errors of the ensemble members are 
converted into weights using the same four methods as in the EN-meta ensemble. 

We evaluated the different versions using validation set testing; the best result were 
achieved for the version using a linear transformation and combining the best 
M ensemble members with M = 15; we refer to this ensemble as EN-dynamic. 


5 Results and Discussion 


5.1 Performance of EN-Meta 


Table 1 shows the accuracy results of EN-meta for the two different types of meta- 
learners and four weight calculation methods. The graph in Fig. 3 presents the MAE 
results in sorted order for visual comparison. We also conducted a pair-wise com- 
parison for statistical significance of the differences in accuracy using the Wilcoxon 
rank-sum test with p < 0.05. The results can be summarized as follows: 


e Overall performance: The most accurate version of EN-meta is kKNN-bestM-lin, 
which uses kNN meta-leaners, combines the predictions of only the best 
M ensemble members and uses linear transformation to convert the predicted errors 
into weights. It is followed by kNN-bestM-softmax, which differs only in the 
weight calculation function -softmax instead of linear, and then by NN-bestM- 
softmax. 

e The pair-wise differences in accuracy between these three best models are not 
statistically significant but all other differences between the best model (kNN- 
bestM-lin) and the other models are statistically significant. 

e All vs best M ensemble members: The EN-meta versions combining only the pre- 
dictions of the best M ensemble members are more accurate than their corresponding 
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versions which combine the predictions of all ensemble members and these differ- 


ences are statistically significant. 


e Linear vs softmax weight calculation: The EN-meta versions using linear weight 
calculations outperform their corresponding versions using the softmax weight 


calculation in 3/4 cases but the differences are not statistically significant. 


e NN vs kNN meta-learners: The EN-meta versions using KNN meta-learners are 
more accurate than their corresponding versions using NN meta-learners in all 4 


cases but these differences are not statistically significant. 


Based on these results we selected the best version (EN-meta-kNN-bestM-lin) for 


further investigation. We will refer to it as EN-meta. 


Table 1. Accuracy of EN-meta versions 


90 

EN-meta MAE RMSE ES 89 

[kW] [kW] = 88 
with NN meta-learners = 97 
NN-lin 88.40 115.35 = 86 | | 
NN-softmax 89.63 116.13 85 
NN-bestM-lin 87.75 115.55 S S FS WP oF 
NN-bestM- 87.68 115.29 ES SEE SS ES 
softmax e X Ss S So 
with kNN meta-learners G | Meta-learner and weight 
KNN-lin 88.10 114.89 calculation method 
KNN-softmax 89.61 116.11 . 
kNN-bestM-lin | 86.77 114.57 Fig. 3. MAE comparison 
kNN-bestM- 87.34 115.00 
softmax 


5.2 Comparison with Other Methods 


Table 2 compares the accuracy of EN-meta with the single models, classical ensembles 
and the two EN versions without meta-learners (static and dynamic). Figure 4 graph- 
ically presents the MAE results in sorted order for visual comparison. The main results 


can be summarized as follows: 
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e The proposed EN-meta is the most accurate method. It considerable outperforms all 
other methods and all differences are statistically significant (Wilcoxon sun-rank 
test, p < 0.05). 

The next best performing methods are EN-dynamic and EN-static, the EN-meta 
versions without meta-learners. This shows that the use of meta-learners was 
beneficial. 

EN-dynamic is more accurate than EN-static and the difference is statistically 
significant. This shows the advantage of tracking the error of the ensemble members 
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on recent data and correspondingly weighting their contribution in the weighed 
vote. 

e By comparing the two dynamic ensembles, EN-meta and EN-dynamic, we can see 
that the use of meta-learners and the more proactive approach of EN-meta for 
assessing the ensemble members - based on predicted error for the new day rather 
than error on previous days, gives better results. 

e Bagg is the most accurate classical ensemble, followed by RF and Boost. All 
classical ensemble models outperform the single models, except for Boost which 
performed slightly worse than the single NN. 

e From the single prediction models, NN is the best, followed by SVR, P and KNN. 
All forecasting models except KNN outperform the baseline P model. 


Table 2. Accuracy of all models 


Method MAE RMSE 3 120 
ce ST TTL 
EN-meta 86.77 114.57 49 

Single models S 0 | | | 

NN 116.64 154.16 ie A x 

SVR 121.58 | 158.63 SEE See <s 
kNN 127.64 166.15 EL 9 

Persistence 124.80 184.29 Method 

Classic ensembles 

Bagg 102.8 1469:40 Fig. 4. Comparison of all models (MAE) 
Boost 118.08 158.80 

RF 110.29 146.25 

EN-meta without meta learners 

EN-static 102.50 134.25 

EN-dynamic 100.46 130.61 


6 Conclusion 


We considered the task of forecasting the PV power output for the next day at half- 
hourly intervals from previous PV power data. We proposed EN-meta - a meta-learning 
ensemble of NNs. The key idea is to pair each ensemble member with a meta-learner 
and train the meta-learner to predict the error for the next day of its corresponding 
ensemble member. The errors are then converted into weights and the final prediction is 
formed using weighed average of the individual predictions. EN-meta is a dynamic 
ensemble as the combination of predictions is adapted to the characteristics of the new 
day based on the expected error. 

We investigated four strategies for converting the predicted error into weights and 
two types of meta-learners (kNN and NN). We also compared the performance of EN- 
meta with three state-of-the-art ensembles (bagging, boosting and random forest), four 
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single models (NN, SVM, KNN and persistence) and two versions of EN-meta without 
meta-learners. The evaluation was conducted using Australian data for two years. Our 
results showed that EN-meta was the most accurate model, considerably and statisti- 
cally significantly outperforming all other methods. The kNN meta-learners were 
slightly more accurate than the NN meta-learners, and the most effective strategy was 
combining only the best M ensemble members and using linear transformation to 
calculate the weights. The use of meta-learners to directly predict the error for the new 
day, instead of estimating it based on the error for the previous days, was beneficial. 

Hence, we conclude that dynamic meta-learning ensembles are promising methods 
for solar power forecasting. 
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Abstract. The technique bag-of-little bootstrap provides statistical 
estimates equivalent to the ones of bootstrap in a tiny fraction of the 
time required by bootstrap. In this work, we propose to combine bag-of- 
little bootstrap into an ensemble of classifiers composed of random trees. 
We show that using this bootstrapping procedure, instead of standard 
bootstrap samples, as the ones used in random forest, can dramatically 
reduce the training time of ensembles of classifiers. In addition, the exper- 
iments carried out illustrate that, for a wide range of training times, the 
proposed ensemble method achieves a generalization error smaller than 
that achieved by random forest. 


1 Introduction 


One of the most successful paradigms of machine intelligence is ensemble learn- 
ing [3,5,7]. Ensembles build a set of diverse predictors by applying different 
randommization and/or optimization techniques. One of the first optimization 
based ensembles is adaboost [8]. In adaboost, the base predictors of the ensemble 
are trained sequentially. To train each single model, adaboost modifies the train- 
ing set in order to increase the importance of the examples incorrectly classified 
by the previous models. This can be seen as an optimization problem solved by 
gradient descent in functional space [13]. On the other hand, diversity in the 
base classifiers could be generated by introducing some randomization into the 
generation process of the base classifiers. The randomization can be applied at 
different levels (e.g. into the training dataset, into the learning algorithm, etc.). 
Randomization is especially effective when unstable base learning are used. For 
instance, random forest uses random trees as the base learners of the ensemble. 
Such trees are unstable by construction as the splits of the tree are computed 
from a reduced random subset of the input attributes [3]. In addition, random 
forest trains each base classifier on a random bootstrap sample, where a boot- 
strap sample consists in extracting n instances at random with replacement from 
the original training data of size n. 

The bootstrap technique was first proposed as a statistical technique to assess 
the quality of estimates [6] and was later applied to the generation of classifiers in 
ensemble learning [2]. An important drawback of this technique, however, is its 
high computational complexity. There are several alternatives to bootstrap that 
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are computational more efficient, such as subsampling (i.e. small samples without 
replacement) or m-out-n bootstrap (with m < n). In fact, several theoretical 
and empirical studies have shown that the accuracy of bagging can increase 
significantly when smaller samples are used [9,12,14]. Notwithstanding, in [10] 
it is shown that m-out-of-n and subsampling require statistical corrections when 
used as a technique to assess the quality of estimates. In [10], and in a previous 
study of the same authors [11], bag-of-little bootstrap (BLB) is proposed as 
an alternative to bootstrap, which is computationally more efficient and that 
has the same statistical properties (consistency and correctness) of bootstrap. 
The study provides a theoretical analysis of the method and several experiments 
in synthetic and real data that show the good statistical properties of BLB. 
However, no real application to classification or regression is performed. 

In this article we analyze the use of bag-of-little bootstrap as a mean to 
accelerate the construction of random forest ensembles. The generalization per- 
formance of this modified version of random forest is compared to standard 
random forest. The experiments carried out show that the proposed ensemble 
clearly outperforms standard random forest achieving, for a wide range of allowed 
training time budgets, a lower generalization error. In the actual context of large 
datasets, this benefit can be a fundamental advantage to be able to produce a 
classification model in reasonable time. 

The article is organized as following: Sect. 2 describes bag-of-little bootstrap 
technique and its combination with random forest; Sect. 3 shows a experimental 
comparison of random forest using standard bootstrap and bag-of-little boot- 
straps; Finally, in Sect. 4, the conclusions of the present study are presented. 


2 Proposed Method 


The method bag-of-little bootstrap (BLB) [11], samples the data in two steps. 
First, a small number of instances is sampled without replacement from the orig- 
inal dataset. The size of this small sample is set to b = n7, with y € [0.5, 1] and 
where n is the size of the dataset. A number of s small samples are extracted 
from the original dataset. We will call these samples primary samples, Dorimary- 
Then, r secondary samples of size n are extracted with replacement from each 
of the primary samples, Dprimary- Finally, from each of the secondary samples, 
an estimate of the desired quantity is obtained. It is important to note that the 
secondary samples can contain at most b instances, which is the size of Dorimary- 
Hence, instead of actually sampling from Dyrimary, it is sufficient to weight the 
instances using a vector containing the number of times each instance is sam- 
pled. This vector of counts can be obtained by drawing n-trials from a uniform 
multinomial distribution of b elements. Note that the value of b is expected to 
be much smaller than n (b << n). This is the key implementation feature that 
allows bag-of-little bootstrap to achieve computational efficient estimates. The 
focus of [11] is on the statistical properties of BLB and not as a tool to create 
ensembles of classifiers. 

In this article, we propose to use bag-of-little bootstrap in combination with 
random trees. The procedure is shown in Algorithm 1. The algorithm has as input 
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the training dataset, Dirain, composed of n instances, and three parameters: the 
size of the primary samples, b, the number of primary samples, s, and the number 
of secondary samples, r. The secondary samples are weighted with a vector of 
counts drawn from a n-trial uniform multinomial distribution of size b. Finally, 
this weighted dataset is used to train a random tree classifier. 


Algorithm 1. BLB-RF 
Data: 
Derain = UXi Yi) Fey 
b = n” size of the primary samples 
s, number of primary iterations 
r, number of secondary iterations 
Result: {hi}i*7 
for i — 1 tos do 
Dorimary = sample _without_replacement (D;rain b) 
for j — 1 tor do 
counts = uniform_multinomial (n,b) 
ha-)r+j = train_random_tree(Dprimary , counts) 


arth wN MH 


end 
7 end 


3 Experiments 


Several experiments have been carried out in order to analyze the validity of the 
technique bag-of-little-bootstraps (BLB) applied to ensembles of classifiers. To 
this end, BLB was implemented as the random sampling mechanism to build 
an ensemble composed of random trees, i.e. the decision tree algorithm used in 
random forest. The efficiency of the proposed ensemble, in the following BLB- 
RF, is compared with standard random forest (RF) under several experimental 
conditions. The base classifier used in both ensembles is random trees, which 
is a modified CART tree [4] in which no pruning is applied and in which at 
each node a random subset of attributes is selected to find the best split. The 
default parameter value was used for the number of attributes to be selected at 
each node (i.e. sqrt(#attribs)) for both random forest and BLB-RF. The two 
algorithms were trained using two fairly large datasets in order to assess the 
lower computational complexity of BLB-RF with respect to RF. The datasets 
used are: Magic04 [1], that has 19020 instances and ten numeric attributes, and 
Waveform [4], a synthetic dataset with 21 numeric attributes. Waveform was 
used in the experiments since it is possible to generate as many instances as 
needed. Two dataset sizes were considered in the experiments with Waveform: 
20,000 and 1,000,000 instances. 

In a first batch of experiments, the performance of BLB-RF was analyzed for 
mid-sized datasets (n = 20000 instances) for different values of: s, the number 
of subsamples taken from the original training set; y, that determines the size of 
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the samples as n7; and, r, the number of secondary bootstrap samples extracted 
from the s primary random samples. The range of values used in this experiments 
are: s € [1,20], y = (0.5,0.6,0.7,0.8,0.9,0.95) and r € [1,40] for Magic04. For 
Waveform with 20000 instances the same values for gamma were used but s 
and r were expanded to s € [1,25] and r € [1,60]. The size of random forest 
with standard bootstrap sampling is set to 500 trees. Ten times 10-fold cross- 
validation was used as the validation procedure. The reported values are averages 
over the 100 train-test realizations. In addition, for each realization and given 
value of y, a single execution of BLB-RF is carried out using the maximum value 
of s and r. Once this ensemble is trained, results for intermediate values of s and 
r can be readily obtained by discarding the corresponding decision trees. The 
reason for this experimental design decision is twofold. First, to reduce the total 
computational burden of the experiments and second, and more importantly, to 
reduce the variability of the results that would be obtained with independent 
executions. 

Figurel shows the average results for some representative values of y for 
Magic04 (left column) and Waveform (right column). Each plot in the figure 
shows, for the given y value, the average error of the ensembles with respect to 
the average CPU time needed to train each single ensemble, in log scale. Each 
point in the plots represents a complete ensemble for a pair of s and r values. To 
facilitate the interpretation of the plots, executions sharing the same value of s 
but different values of r are linked with solid lines. For instance, the first point of 
the yellow line (s = 5) corresponds to an ensemble with s = 5 primary samples 
each of which is used r = 1 time to generate secondary bootstrap samples. This 
corresponds to an ensemble of 5 trees. The second point on the same line is the 
ensemble trained using s = 5 and r = 2, which has 10 trees, and so on. As another 
example, for Magic04, the last point of all BLB curves corresponds to r = 40, 
which means that the larger ensembles for each curve are of size 1 x 40 = 40 for 
the red line, 200 for the yellow, 400 for the blue and 20 x 40 = 800 for the purple 
line. In the case of Waveform, in which we used an expanded grid up to r = 60, 
the purple curve gets to 25 x 60 = 1500 decision trees. 

From Fig. 1 several interesting aspects of BLB-RF can be identified. First, 
for small values of y (plots in the first row), BLB-RF is able to output a decision 
at a fraction of the time needed by random forest. The first random forest tree 
is build after almost 1s for both Magic04 and Waveform datasets. BLB-RF 
is able to obtain the first tree is less than 0.002s and in consequence is able to 
produce a first classification over 500 times faster than random forest. In fact, for 
Magic04 the ensemble with Smar and Tmax (composed of 20 x 40 = 800 trees) 
is trained in approximately the same time as the first tree of random forest. In 
addition, this ensemble obtains a classification error significantly better than the 
one obtained by the first tree of random forest. In Waveform, the training time 
to build BLB-RF with s = 25 and r = 60 is roughly the same as the time needed 
to build two trees of random forest with a noticeable difference in generalization 
performance. BLB-RF with s = 25 and r = 60 achieves an average generalization 
error of =15%, and two random trees of random forest achieves ~25%. 
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Fig. 1. Results for Magic04 (left column) and Waveform (right column) for different 
values of y 


As the value of y increases, the curves corresponding to BLB-RF tend to be 
closer to the curve of random forest. As it can be observed from the figures of 
Magic04 and Waveform, in general the performance of BLB-RF is better than 
that of random forest except in Magic04 for y = 0.95 in some configurations of s 
and r. For these last cases, the generalization error of BLB-RF is slightly worse 
than random forest for the same computational time. 
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In order better visualize this aspect, we have computed the best and worst 
performance in terms of the generalization error with respect to the computa- 
tional time of the executions. That is, for a given computational budget, t, the 
best and worst results are extracted from all the combinations of s, y and r 
that could be trained in less than or equal to t seconds. The result is plotted in 
Fig. 2. From this plot it is clear that for both analyzed dataset, the performance 
of BLB-RF is generally better than random forest for all possible time budgets. 


Magic04 


Waveform 


Error 


01 1 1 ı 1 
0.001 0.01 0.1 1 10 100 1000 1000 


Fig. 2. Best and worst generalization error for all configurations of BLB-RF with 
respect to the computational time budget for Magic04 (left) and Waveform of 20000 
instances (right) 


To validate the performance of BLB-RF in a larger dataset, we have con- 
ducted a second experiment on the Waveform dataset generating 10° instances. 
For this experiment one 10-fold cross-validation was used to validate the perfor- 
mance of the algorithms. Hence, the training times are based on training datasets 
composed of 900000 instances. For computational limitations, the size of random 
forest is reduced to 50 random trees. Similarly the range of parameter values for 
BLB-RF is reduced to s = € [1,10] and r € [1,20]. The values for y are kept the 
same, that is {0.5,0.6,0.7,0.9,0.95}. The results for this experiment are shown 
in Fig. 3. Similarly to Fig. 2, this figure shows the average generalization error 
of the best and worst configurations of BLB-RF with respect to the training 
computational budget, t. Random forest is also plotted. 

From Fig. 3, we can observe that the performance of BLB-RF in significantly 
better than the one of random forest for all possible time budgets. The differ- 
ences between both methods have clearly increased with respect to the use of 
the smaller Waveform set (see right plot on Fig. 2). BLB-RF produces the first 
classification result in less than 0.05s while random forest needs ~500s, which 
is over 10000 times slower. In fact, in this setting, BLB-RF is able to achieve 
a generalization error lower than the one of random forest before the first tree 
of random forest is trained. BLB-RF achieves the final error of random forest 
(14.3) after 20s and random forest needs over 24, 000s. 


544 P. de Vina and G. Martinez-Munoz 


0.3 T T T 
bests 
worst 

RF—— | 


Error 


0.14 F 4 


0.12 1 1 fi fi 1 L 
0.01 0.1 1 10 100 1000 10000 100000 


Fig. 3. Best and worst generalization error for all configurations of BLB-RF for a given 
computational time budget for waveform with 10° total instances 


4 Conclusions 


In this article we propose the use of the technique of bag-of-little bootstraps 
together with an ensemble of random trees. This technique produces statistical 
estimates equivalent to bootstrap using a fraction of the time. The techniques 
proceeds in two steps. First, small random samples from the data are extracted 
without replacement. From each of these small samples, r bootstrap samples 
with replacement are generated with the size of the original dataset. For this 
second sampling, the instances are weighted using a vector of counts drawn from 
a uniform multinomial distribution. Finally, a random tree is trained on each of 
the weighted samples to compose the ensemble. 

We have shown that the proposed ensemble is computationally much more 
effective than random forest. On the one hand, we have shown that for relatively 
large datasets, the proposed method is able to train an ensemble in a time that 
is orders of magnitude smaller that the time required to build the first tree of 
random forest. On the other hand, for a large range of given time budgets, the 
proposed ensemble is able to achieve a generalization error lower than that of 
random forest. 
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Abstract. Despite that the majority of machine learning approaches 
aim to solve binary classification problems, several real-world applica- 
tions require specialized algorithms able to handle many different classes, 
as in the case of single-label multi-class and multi-label classification 
problems. The Label Ranking framework is a generalization of the above 
mentioned settings, which aims to map instances from the input space 
to a total order over the set of possible labels. However, generally these 
algorithms are more complex than binary ones, and their application on 
large-scale datasets could be untractable. 

The main contribution of this work is the proposal of a novel general 
on-line preference-based label ranking framework. The proposed frame- 
work is able to solve binary, multi-class, multi-label and ranking prob- 
lems. A comparison with other baselines has been performed, showing 
effectiveness and efficiency in a real-world large-scale multi-label task. 


Keywords: Preference Learning Machine - Multi-class - Multi-label 
Big data - Large-scale 


1 Introduction 


Nowadays, the majority of Machine Learning techniques are able to solve binary 
classification problems, where the algorithms try to determine if a pattern 
belongs to either a positive (+1) or a negative (—1) class. Despite that the 
binary classification setting is the most known, studied and used, there are sev- 
eral problems and real-world applications in which this approach is not suitable, 
as is the case of multi-class and multi-label models. 

In the literature several mechanisms exist to extend the binary classifica- 
tion setting. The simplest approach is based on decomposition methods, such as 
the one-against-one and one-against-all [8] approaches. Basically, these methods 
decompose the original multi-class problem in several binary tasks. Then, these 
© Springer Nature Switzerland AG 2018 


V. Kürkovä et al. (Eds.): ICANN 2018, LNCS 11139, pp. 546-555, 2018. 
https: //doi.org/10.1007/978-3-030-01418-6_54 


Learning Preferences for Large Scale Multi-label Problems 547 


binary problems are solved using binary classifiers and predictions are combined 
with a voting procedure. More complex approaches try to model a single multi- 
class/multi-label problem, as in the case of the Label Ranking framework based 
on preferences [15], which aims to learn a total order on the set of possible labels. 
However, these methods usually suffer from scalability issues with respect to the 
number of classes, making the original problem untractable when this number 
is large. Besides, due to the constant growth of the available data, a challeng- 
ing goal of these algorithms is to solve these problems efficiently in terms of 
computational cost, and required resources. 

Inspired by these motivations, this paper presents an extension of the Pref- 
erence Learning Machine (PLM) [2], a general label ranking framework to learn 
preferences in binary, multi-class and multi-label setting. The proposed exten- 
sion mainly includes an efficient and scalable learning procedure, based on the 
Voted Perceptron algorithm [7], and online learning capability. 

The proposed approach has been compared with Neural Networks on a 
real-world multi-label application. The multi-label task consists of a large-scale 
semantic indexing of PubMed documents, based on the Medical Subject Head- 
ings (MeSH) thesaurus. 


2 Notation and Background 


In the (single-label) multi-class classification problem, the unique label associ- 
ated to each pattern æ from the input space X C Rf, is selected from a prede- 
fined set of labels 2 = {w1,..., Wm}, where m is the number of possible labels 
m = |2|. A common example of multi-class problem is the digit recognition, 
where the goal is to find the true digit corresponding to a handwritten input [9]. 

Let us now consider the problem of associating keywords from a given set 
Q to a textual document [3]. Differently from the previous case, the number of 
associated labels (keywords) can be more than 1, and each document might have 
a different number of keywords. Hence, the task is to learn a mapping from a 
document to a set of labels. These kinds of problems are referred to multi-label 
classification problems. 
It is easy to see that the single-label multi-class problem is a generalization of 
he binary setting, where m = 2 and, in turn, the multi-label is a generalization 
f the single-label multi-class problem. 
In all of these settings, the label set y € Y C (+1,—1)”” associated to each 
pattern x € Y can be coded as a binary m-dimensional vector, where each 
element y; is active (+1) if and only if the label w; is assigned to the pattern zx. 

Based on this code, training examples can be kept into two matrices. Let 
X € R'*@ be the training matrix, where d-dimensional vectors are arranged in l 
rows, and let Y € {+1,—1}!*™ be the corresponding label matrix, where rows 
contain the code of the training patterns. The notation æ; is also used to identify 
the ¿-th pattern. 

Besides the concept of multi-label classification, the more general multi-label 
ranking has been introduced [4]. The multi-label ranking approach aims to pre- 
dict the ranking of all labels instead of predicting only the set of relevant ones. 


O + 
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2.1 Related Work 


Motivated by the increasing number of new applications, such as automatic 
annotations of video, images and textual documents, the problem of learning 
from multi-label data is affecting a large part of the modern research. Recently, 
several different approaches have been developed aiming to solve multi-label 
problems [6,12,14]. It is possible to divide these methods into two categories 
[13]: adaption methods and problem transformation methods. 

Adaption methods extend specific machine learning algorithms to handle 
multi-label data, as in the case of Neural Networks which use an extended back- 
propagation algorithm with dedicated error functions (see [11] for a detailed 
explanation). 

Problem transformation methods, instead, are those algorithms which map 
the multi-label classification problem into one or more binary tasks. The most 
known problem transformation approach is the one-against-all decomposition 
method [8]. This method generates an ensemble of m = |{2| binary classifiers. 
The i-th classifier is trained with all the examples of the i-th class as positive 
labels, and all the other examples as negative labels. When models are trained, 
there are m decision functions. In a ranking multi-label setting, these decision 
functions define the score for each label. Furthermore, in a single-label multi-class 
problem the predicted label is the one which achieves the highest score. 

See [1,15,16] for detailed surveys of multi-label problems. 


3 Working with Preferences 


Several algorithms able to solve Label Ranking problems exist in the literature. 
Some of them are based on the concept of preferences, which define an ordering 
relation on labels and examples. Methods based on preferences try to find a 
ranking hypothesis fo : X x 2 — R with parameters ©, which assigns for each 
label w; € 2 a score to a fixed pattern ze A, folx,w;). 

These algorithms can be restricted to two particular cases: learning instance 
preference and learning label preference [5]. 

In the instance preference scenario, a preference relations is defined as a 
bipartite graph g = (N, A), where N C X x N is the set of nodes and A C Nx N 
is the set of arcs. 

A node n = (a;,w,;) € N is a pair composed by an example and a label, and 
it is a positive node iff the label w; is positive for the example x;, otherwise n 
is a negative node. 

An arc a = (Nns, ne) € A connects a starting (positive) node ns = (&;,w;) to 
its ending (negative) node ne = (Œk, wq). The direction of the arc indicates that 
the starting node must be preferred over the ending node. 

The margin of an arc a = (n,,n.) is the difference between the application 
of the ranking function fo on the starting and ending nodes, 


pala, O) = fo(ns) = folne) = fo(zi,w;) = fo(tr;wg)- 
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An arc a = (ns, ne) is consistent with the hypothesis fo iff the assigned score to 
the node ns is greater than the score assigned to the node ne, fo(ns) > fo(ne), 
thus the margin pa(a, O) > 0. The margin of a graph g = (N, A) is the minimum 
margin of its arcs pa(g, O) = minaca pa(a, O). Then, a graph is consistent with 
the hypothesis fo iff its arcs are consistent, pa(g, O) > 0. 

In the instance preference task instead, preferences are defined by considering 
a single example at a time. In this scenario, an arc a € A considers nodes with 
the same example, a = (ns, Nne), with ns = (&;,w;) and ne = (£i, wq) 

It is easy to see that the label preference scenario tries to separate simultane- 
ously the whole set of examples with their positive nodes and the set of negative 
nodes. Thus, it is suitable for solving classification tasks. In the instance prefer- 
ence approach instead the algorithms try to optimize the inner ordering for each 
example. 

Some examples of instance preference graphs for a 2-label classification prob- 
lem are shown in Fig. 1, where for each example: (a) there is only one fully con- 
nected graph which connects all positive labels to all negative ones; (b) for each 
example there are two graphs which connect each positive label to all of the 
negatives; (c) there is a graph for each pair of labels, the first positive and the 
second negative. The architecture of these graphs is a hyperparameter selected a 
priori. Note that for each graph structure, the number of total arcs is the same. 


Ge A 2) COCO 
LAS + + + + + + 
(m) © ng) (n41 ) (n2) (n3 ny) (n2) (n3) (n n2 ) (M3 n2 (ns) 


Fig. 1. Examples of preferences for 2-label classification. p; are the positive labels and 
nj the negative ones. 


The last ingredient of a preference algorithm is a loss function £ which penal- 
izes the non-consistent preferences. A label ranking algorithm based on prefer- 
ences tries to find the hypothesis f from the hypothesis space F which minimizes 
L. Loss functions considered in this work are based on the margin of graphs: 


f = arg mn L(pa(g, ®)) 
gEV 


where V is the set of preference graphs. 


3.1 Preference Learning Machine 


The Preference Learning Machine (PLM) [2] belongs to the label preference 
setting. It is a general kernelized framework for solving multi-class and label 
ranking problems, by learning a function to map each example to a total order 
on the set of possible labels. 
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The PLM framework consists of a multivariate embedding h : X — R° 
parametrized by a set of s vectors Wp € R@, k € {1,...,s} arranged in the 
matrix W € R°*@. Thus, h(x) = [hı(®),...,hs(®)] = [(Wı,®),..:, (W3, 2)]. 
Furthermore, let M € R”** be the matrix containing the s-dimensional code 
for each label w; € Q. 

The scoring function for a given example x and a given label wp can be 
computed as the dot product between the embedding and the code vector of wr, 
that is 


f(æ,wr) = (hlæ), Mp) = Y M, (Wr, 2). 
k=1 
The original PLM [2] considers a fixed m-dimensional orthogonal coding M, 
defined as the m x m identity matrix. Authors also formulated the problem of 
learning the embedding W as a kernelized optimization problem. 


4 The Proposed Extension 


In the proposed setting, preferences consist of graphs with two nodes connected 
by a single arc. The first node is represented by an example with one of its 
positive labels, whereas the latter node is an (potentially different) example 
with one of its negative labels. 

The main extension concerns the possibility of learning the Coding matrix 
M, making the algorithm more expressive with respect to the original one. Two 
version of the algorithm are proposed in this work, which are the EC-PLM 
(Embedding-Coding PLM) and the EP-PLM (Embedding-PCA PLM). 

The EC-PLM uses a pair of Voted Perceptron [7] algorithm to efficiently learn 
both, the Embedding W and the coding M. Broadly speaking, the EC-PLM 
performs an alternate optimization procedure to learn its parameters. During 
each epoch, the algorithm fixes the Coding and optimizes the Embedding by 
means of a Voted Perceptron. Then, it fixes the Embedding while optimizes the 
Coding by using the same procedure. After each optimization, the Embedding 
W and the Coding M are rescaled with their Frobenius norm, W — SW 
Me Li M; ` 

The training set used to learn the Embedding is composed by preferences. 
Let a be the arc of a preference graph which connect the starting node (a;,w,) 
with the ending node (a, wq). The preference uses the same representation of the 
PLM, which consists of a s xd dimensional vector z = (Mo, & xi) — (Mo, & Lk), 
where ® denotes the kron product between vectors and M,,,, Mw, are the codes 
of wj and wy. The dimensionality s of codes is a hyperparameter. 

When the latter perceptron learns the coding matrix, preferences are defined 
as z = (ys ® (W,zx;)) — (ye D (W, x,)), where y, is a 0 m-dimensional vector 
with an 1 at the j-th element. However, the algorithm requires an initialized 
code matrix at the first epoch, to learn the first embedding. The initial coding 
M contains random values. 
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Furthermore, a faster version of the PLM has been considered, dubbed EP- 
PLM, in which the coding M is computed by means of a Principal Component 
Analysis (PCA) procedure. Thus, the algorithm requires a single Voted Percep- 
tron procedure. 

Let Kr be the linear kernel between labels Kr = Y Y ', which counts the 
number of common examples for each pair of labels. The kernel matrix is then 
decomposed as UAU !, where U is the matrix contains the eigenvectors, and 
A the diagonal matrix containing the eigenvalues. The Coding M is defined as 
U,A,, where U, is the matrix contains the s eigenvectors associated to the top 
s eigenvalues. Note that the complexity of this approach mainly depends on the 
number of labels, and it can be applied on very large scale datasets. 

The pseudo-code of the EC-PLM algorithm is shown in the Algorithm 1. 


Algorithm 1. The Embedding-Coding Preference Learning Machine 
Input: 
s: the dimensionality of codes 
t: the number of epochs 
X: the training matrix 
Y: the label matrix 
Output: 
W: the embedding function 
M: the coding function 
Ww) 4 {0 
M) — random m x s code matrix 
foriel...t do 
W“ — Voted_Perceptron(M=D) 
M” — Voted_Perceptron(W ®) 
end 
return wo, MO 


Noa AOUN H 


Due to the characteristics of the Voted Perceptron algorithm and its capa- 
bility to work with one preference at a time, the EC-PLM can be easily used to 
work with on-line streams of examples and preferences. 

On the other hand, the EP-PLM is able to learn the coding with millions of 
examples efficiently. Furthermore, on each epoch it uses a single Voted Percep- 
tron to learn the Embedding. The complete procedure is very fast, especially if 
the input examples use a sparse representation. 


5 Experimental Assessment 


In order to empirically evaluate the proposed method, it has been tested on 
a complex multi-label task, which consists of a large-scale online biomedical 
semantic indexing of PubMed documents based on the Medical Subject Head- 
ings (MeSH) [10]. The MeSH thesaurus is a controlled vocabulary produced 
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by the National Library of Medicine (NLM), used for indexing and cataloging 
the biomedical literature in MEDLINE, that is the NLM bibliographic database 
containing 24 million journal articles. 

The MeSH vocabulary consists of a hierarchy of tags. This work focused on 
the bottom layer of this hierarchy, which includes 28333 descriptors or heading 
tags, that represent main topics or concepts in the biomedical literature. 

In this setting, heading tags represent the set of all possible classes or labels, 
and the task is to find for each example a total order in this set. 


5.1 Baselines 


The proposed methods have been compared against a Multiple Layer Perceptron 
(MLP) which represents the same architecture used in the PLM. Let us consider 
a fully connected MLP with a d-dimensional input layer, which maps the input 
into a hidden s-dimensional layer by means of a dense d x s linear connection. 
Then, the hidden layer maps information on a m = |f2| dimensional output layer 
by using a dense sx c linear connection. With this perspective, it is easy to show 
that the two mappings between layers correspond to the Embedding W and 
Coding M used in the PLM setting. 

However, although the PLM can be mapped into a MLP and vice versa, the 
learning mechanisms used are quite different. The MLP uses a back-propagation 
procedure whereas the PLM tries to optimize each input preference. 

Other baselines have been initially considered. These are the Support Vector 
Machine (SVM) with one-against-all multi-class strategy, and the original PLM. 
Anyhow, due to the dimensionality of the considered problem and the complexity 
of these methods, only the MLP has been used. 


5.2 Empirical Evaluation 


A wide experimental setting has been used to compare the two versions of the 
algorithm, in terms of AUC score, computational cost and required resources. 

At first, 20000 abstracts have been randomly selected from the PubMed 
repository with their respective MeSH tags. Abstracts have been tokenized by 
considering spaces and punctuation, and stop-words have been removed. The 
stop-list is the one defined by the scikit-learn library. The global dictionary has 
been computed by considering only unigrams. 

Then, the resulting dictionary has been reduced, by considering only the 
100000 most frequent terms. Finally, the Bag-Of-Words (BOW) feature vector 
has been computed on each input document. A test set has been preprocessed 
using the same pipeline, and it also includes 20000 abstracts. To compute the 
coding matrix M used in the EP-PLM version, a PCA over 10 million of PubMed 
documents has been used. The dimension s of codes has been fixed to 50. On each 
epoch, the Voted Perceptron procedure optimizes 2000 preferences randomly 
selected. Finally, the training subsampling covers 17071 different MeSH tags. 

In order to understand properly the behavior and the empirical convergence 
of the two algorithms, a preliminary analysis has been performed, showing the 
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micro and macro AUC measures while increasing the number of training epochs. 
Results are shown in the Fig. 2. 

It is self-evident from the picture that the EP-PLM outperforms empirically 
the EC-PLM, even if it uses a fixed code matrix instead of learning dynamically 
it from data. Probably, this improvement is due to the fact that the EP-PLM 
uses 10 million of examples to learn the coding instead of 20000 as is the case of 
EC-PLM. In terms of computational cost, the EC-PLM requires on average 132s 
to complete a single epoch, whereas the EP-PLM required 95 min to compute 
the PCA, and 19s per epoch. The experiments were carried out on an Intel(R) 
Xeon(R) CPU E5-2650 v3 @ 2.30 GHz. 
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Fig. 2. Empirical convergence of the proposed algorithm. 


Subsequently, the combination of several EC /EP-PLM models have been ana- 
lyzed exploiting a bagging procedure, aiming to facilitate the application of these 
algorithms on large-scale problems. 10 different datasets have been extracted 
from the PubMed repository following the procedure mentioned at the beginning 
of this section, each with 20000 training examples. Figure 3 shows the empirical 
effectiveness of the algorithm while increasing the number of models in the case 
of EC-PLM and EP-PLM. Not surprisingly, the bagging procedure has a strong 
impact on the EC-PLM setting, in which each model uses only 20 000 examples 
for both the Embedding and the Coding. The EP-PLM also increases the AUC 
scores while the number of models increases. 

Finally, a comparison against the Multiple Layer Perceptron has been per- 
formed. Figure4 shows Micro and Macro AUC scores of the EC-PLM and 
EP-PLM against the MLP with linear and sigmoid activation functions, while 
increasing the dimension s € {25,50, 100, 200} of the codes and the hidden layer. 
This experiment shows that the proposed methodologies outperform a MLP with 
the same inner structure of the PLM. Moreover, the value of s affects significantly 
the EC/EP-PLM and the linear MLP in particular. 
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Fig. 3. Micro and Macro AUC scores while increasing the number of combined models. 
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Fig. 4. Micro and Macro AUC scores of EC/EP-PLM and the MLP while increasing 
the dimension s of the middle space. 


6 Conclusion 


We have proposed a general framework for on-line preference-based label ranking 
that can be applied to binary, multi-class, multi-label and ranking problems. 
Two different versions of the algorithm have been discussed and analyzed. The 
first focuses on the efficiency whereas the latter is an effective on-line learner. A 
comparison with some baselines has shown its effectiveness and efficiency in a 
real-world large-scale multi-label task. 
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Abstract. Recent state-of-the-art deep metric learning approaches 
require large number of labeled examples for their success. They can- 
not directly exploit unlabeled data. When labeled data is scarce, it is 
very essential to be able to make use of additionally available unlabeled 
data to learn a distance metric in a semi-supervised manner. Despite 
the presence of a few traditional, non-deep semi-supervised metric learn- 
ing approaches, they mostly rely on the min-maz principle to encode the 
pairwise constraints, although there are a number of other ways as offered 
by traditional weakly-supervised metric learning approaches. Moreover, 
there is no flow of information from the available pairwise constraints to 
the unlabeled data, which could be beneficial. This paper proposes to 
learn a new metric by constraining it to be close to a prior metric while 
propagating the affinities among pairwise constraints to the unlabeled 
data via a closed-form solution. The choice of a different prior metric 
thus enables encoding of the pairwise constraints by following formula- 
tions other than the min-max principle. 


Keywords: Mahalanobis distance - Affinity propagation 
Metric learning - Image retrieval - Person re-identification 
Graph-based learning - Semi-supervised learning - Classification 
Fine-grained visual categorization 


1 Introduction 


Distance Metric Learning (DML) aims at learning the distance between a pair of 
examples with the objective of bringing similar examples together while pushing 
away dissimilar examples. Deep neural networks have demonstrated remarkable 
success in machine learning tasks such as classification, clustering, verification 
and retrieval. DML is a pivotal step in such tasks. As such, deep DML has 
gained much popularity lately. Popular deep DML approaches [8, 18, 20,22, 25] 
aim at learning distance metrics in an end-to-end fashion with a pretrained 
network, such as GoogLeNet [23]. Their success depends on a number of fac- 
tors: (i) Availability of large number of labeled examples, (ii) Formulation of an 
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appropriate loss function (which mainly involves the last layer) and (iii) Mining 
informative constraints. [20-22] discuss a few mining strategies. However, the 
ability to learn from unlabeled data has not been exploited by the deep DML 
approaches. Recently, [15] employed a random walk process to mine constraints 
for deep DML by considering the manifold similarity in an unsupervised manner. 
But the random walk process therein cannot exploit already available pairwise 
similarity/ dissimilarity constraints and hence cannot be directly extended to a 
Semi-Supervised DML (SS-DML) setting. Another important observation is that 
apart from a few approaches like [18,25], very few alternative loss functions have 
been explored in deep DML. On the other hand, conventional DML approaches 
like [6,11,16,28] offer a plethora of ways to formulate a DML loss function. In 
fact, the recent work in [9] represented the last layer as a Symmetric Positive 
Definite (SPD) matrix following a conventional metric learning approach. This 
SPD matrix is jointly learned with a PCA projection matrix following a Rieman- 
nian optimization framework, along with the parameters of the network. These 
factors motivate us to revisit traditional SS-DML approaches utilizing different 
criteria to formulate the DML loss function. Once such a parametric matrix is 
found, it can be easily incorporated within a deep framework as in [9]. 
Traditional DML is referred to as the problem of learning a Mahalanobis-like 
distance: da (x,y) = (x — y) AA? (x — y) = (x — y)M(x — y) = du(x,y) 
for a pair of examples represented as feature vectors x, y € R? (which may have 
been obtained using a convolutional neural network). Here, AT: R? — R! is a 
linear mapping such that M = AA” and M > 0, is a symmetric Positive Semi- 
Definite (PSD) parametric matrix to be learned. Equivalently we can learn A. 
Any SS-DML approach can be formulated as the following general optimization 
problem: 
fom 1M,8,0) + fo(M, 4) (1) 


S and D are the sets of must-link (similarity) and cannot-link (dissimilarity) 
constraints respectively. They provide prior side-information (weak-supervision). 
fi is a function of the weak-supervision, and f2 is a function of the given 
dataset X which also includes the unlabeled data Xy. A majority of the SS- 
DML approaches like [2,14,17,27] use the unlabeled data by expressing f2 as 
the Laplacian regularizer [13], which aims at preserving the topology of the data 
via a graph constructed using the neighborhood relationships among the exam- 
ples. However, an important observation is that in most of these approaches, 
fı is always a variation of the min-max principle: minimizing (maximizing) the 
distances between the data points with must-link (cannot-link) constraints. By 
considering different criteria to choose fi, we can achieve our goal of formulating 
an alternative DML loss function as discussed above. 

This paper addresses the problem of formulating SS-DML approaches by 
expressing fı in terms of prior metrics learned using different criteria, apart 
from the min-max principle. Another important aspect with the existing SS- 
DML approaches like [2,14,27] is that the Laplacian regularizer is computed 
using an affinity matrix based on neighborhood relationships among the data 
alone, without considering the pairwise constraints, which could provide further 
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information. This paper attempts to overcome these limitations. The major con- 
tributions of this paper are as follows: (i) To make use of the pairwise constraints 
in the Laplacian regularizer as well, we follow the affinity propagation principle 
[17] and propose a general, topology-preserving SS-DML framework; (ii) The 
framework enables a closed-form solution to learn a new metric by constraining 
it to be close to a prior metric in a simple way; (iii) Different choices for the 
prior metric have been discussed to facilitate the formulation of new SS-DML 
approaches by expressing fı with alternatives to the min-max principle. 


2 Proposed Semi-supervised DML Framework 


Let X = [xı ... xy] € RAN denote the matrix containing N examples of a 
dataset X as its columns. Let the two sets of pairwise constraints be: S = 
{(x;, X;): X; and x, are similar} and D = {(x,,x,): x; and x, are dissimilar}. 
Let y; = ATx;. The goal is to find M € R?*4 (or A € RN), M > 0 using 
the information provided in S and D, such that du(x:,x;) = da(xi,x;) = 
|y: —y5 l2. Our proposed framework can be expressed as: 


min ||M — Moll? + btr(MXLXT) (2) 
M>0 


where 3 > 0 is a trade-off parameter and ||Q||} = >, Q}; = tr(QQ”) is the 
squared Frobenius norm of a matrix Q. The first term in (2) can be any function 
fı(M, S, D) of the weak-supervision. The main advantage of using this expression 
is that it enables us to arrive at a closed-form solution to the SS-DML problem. 
The goal is to learn the required metric M € R@*@ in such a way that it is 
close to a prior metric My (defined apriori or precomputed using S and D). 
One may argue to set fı using log-determinant divergence as: fi(M,S,D) = 
Dia(M, Mo) = tr(MM5") — log |MM5 +| — d. Although it leads to a convex 
formulation, the solution is non-trivial, and would require a method like Bregman 
projection [6]. Furthermore, the computation of M~! required in computing the 
gradient of D¡¿(M, Mo) is hard. Another reason is that while in theory the log- 
det term ensures that the optimum is within the PSD cone sq, the intermediate 
iterates are not necessarily confined to the cone in practice [1]. 

The second term in (2) represents the Laplacian regularizer for represent- 
ing the manifold structure of the data by a graph [13] constructed using the 
relationships among the data in an Euclidean space, and is defined as: 


N 
tr(MX LX") = tr(A7XLX*A)= 5 S Ivy: — ysl Wis (3) 


ij=1 


where L = D — W is the graph Laplacian and D is a diagonal matrix with 
Dii = Sn, Wij, denoting the degree of a node in the neighborhood graph. 
The affinity/weight W;, represents a measure of similarity between two nodes 
i and j in the graph representing examples x; and x; respectively. One may 
use the heat kernel [13] to set Wi; = e-Isi-x;ll/t if x; € N; or x; E€ N; , and 
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Wij = 0 otherwise. Here t is a scale parameter and N; is the set of k nearest 
neighbors of x;, computed using pairwise Euclidean distances for constructing 
the graph. However, such an assignment of affinity does not make use of the 
pairwise constraints in S and D, which can provide further information about the 
proximity of unlabeled examples (an example not associated with any pairwise 
constraint). It is desirable to have a mechanism for flow of information from 
the sets S and D to an unlabeled example. For this purpose, an adaptation 
of the Affinity Propagation (AP) procedure [17] is considered. Define an initial 
affinity matrix W° € RY*N. Assign WẸ = +1 if i # j and either (x;,x;) € S 
or (xj, Xi) € S or both. Assign W9 = —1 if i # j and either (x;,x;) € D or 
(x;,x;) € D or both. Note that the symmetry in affinities is intuitive, and useful 
in practice as well. Assign Wg = +1, and Wj} = 0 for (x;,x;) ¢ S and (x, x;) ¢ 
D. Define a neighborhood indicator matrix P € RY*<Y as follows: Assign P;; = 
1/k if x; € N;, and P,; = 0 otherwise. k is the number of nearest neighbors under 
consideration. Note that P is asymmetric. Now, the goal is to propagate the 
affinities from entries corresponding to the sets S and D, to the 0-entries of W? 
using the neighborhood structure information provided by the matrix P. It can 
be achieved by following a Markov random walk as: W*+! = (1-a)W’+aPW*, 
where a is a trade-off parameter. As 0 < a < 1 and eigenvalues of P are in [-1,1], 


the limit W* = lim,;...W exists [17], and can be expressed as: 


W* = (1-a)(I-aP)"'w° (4) 


The matrix I—aP is usually sparse in practice. (4) can also be solved as a linear 
system using the conjugate gradient method. For large scale computations, the k- 
NN graph can be efficiently approximated by the method in [7], which is orders 
of magnitudes faster without any effect on performance. The final symmetric 
affinity matrix is obtained as follows: Wi; = (W/}; + Wj;)/2, and used in (3). 

Directly optimizing (2) in terms of M requires maintaining the PSD con- 
straint, which involves computationally expensive projection onto the PSD cone 
sd after every gradient step. Therefore, we consider the following optimization 
problem: 


min || A — Aoli# + Gtr(A™XLX7 A) (5) 


where My = AgAZ. Though we drop the convexity in (5), recent studies [3,4] 
show that non-convex problems like this indeed work very well in practice and 
facilitate scalability. The advantage of using the formulation in (5) is two-fold: 
(1) It eliminates the need for maintaining the PSD constraint, as the final matrix 
M = AA? will be PSD by virtue of construction. (2) It can be solved using a 
simple closed-form solution. Setting the gradient of the objective function in (5) 
to zero, leads to the following closed-form solution for A: 


A = [Ia ONE TAG (6) 
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where Iq is the dx d identity matrix. We refer to this proposed general framework 
as Affinity Propagation based Semi-Supervised Metric Learning (APSSML). 


3 Choices for the Prior Metric 


Based on the choice of the prior metric, we can define a family of related SS-DML 
approaches: 

(i) Log-likelihood ratio based prior metric: The prior metric Mo is 
computed based on a statistical inference perspective obtained using the Keep It 
Simple and Straightforward MEtric (KISSME) learning approach [16]. It consid- 
ers the space of pairwise differences and computes a log-likelihood ratio between 
two multivariate Gaussians to learn the metric. The SS-DML approach using this 
prior metric is called as Affinity Propagation and Log-Likelihood Ratio (APLLR) 
based SS-DML. The motivation behind choosing the prior metric following [16] 
is its effectiveness and simplicity while being orders of magnitudes faster than 
other DML approaches. 

(ii) Identity matrix as prior metric: A naive way of defining Mo is to set 
it to the identity matrix Ig, which avoids the need to compute a prior metric using 
a learning method. This can be done in applications where time is a constraint. 
The resulting approach is called as Affinity Propagation and IDentity matrix 
(APID) based SS-DML. Despite its naiveness, the APID approach performs 
decently as observed later. 

(ii) Information-theoretic prior metric: The prior metric can also be 
obtained from the Information-Theoretic Metric Learning (ITML) [6] approach 
that aims at minimizing the Kullback-Leibler (KL) divergence between an initial 
Gaussian distribution and the distribution parameterized by the learned met- 
ric. The resulting approach is called as Affinity Propagation and Information- 
Theoretic (APIT) SS-DML. 


4 Experimental Studies 


The proposed approaches APLLR, APID and APIT are compared with the fol- 
lowing baselines: the recently proposed state-of-the-art Geometric Mean Met- 
ric Learning (GMML) [28], two SS-DML approaches: Laplacian Regularized 
Metric Learning (LRML) [14] and SEmi-supervised metRic leArning Paradigm 
with Hyper-sparsity (SERAPH) [19]. SERAPH follows entropy regularization 
instead of preserving the topological structure. Hyperparameters of the base- 
line approaches have been tuned to yield the best performance. For the pro- 
posed approaches, number of neighbors for computing the graph Laplacian is set 
between 6 to 20. œ is mostly kept as 0.5 or 0.6, as the performance is mostly 
insensitive to its value. However, we do set it to 0.9 or 0.1 occasionally. All other 
parameters related to the APSSML framework are empirically tuned in the range 
{1077, ...,10?}. 

Using both hand-crafted as well as deep features, experiments have been con- 
ducted on a variety of machine learning tasks: (i) Classification on benchmark 
UCI datasets (iris, wine, balance, diabetes and breast cancer), (ii) Handwritten 
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digit recognition on the USPS dataset, (iii) Fine-grained visual categorization 
on the Caltech-UCSD Birds-200-2011 dataset (CUB) [24], (iv) Person 
re-identification on the VIPeR dataset [10], and (v) Image retrieval on the 
NUS-WIDE dataset [5]. The UCI datasets are split into 70%-15%-15% ratio 
for training-validation-testing with data normalization. Only 10% of the train- 
ing data is considered as labeled for both UCI and USPS datasets. For the CUB 
dataset, the ResNet [12] features given in [26] have been used. The train-val 
classes as given in [26] have been used for learning the parametric matrix. Only 
30% of the data from each of 150 training classes is considered as labeled. Per- 
formance on the test-seen data has been reported here. The approaches under 
consideration have been compared using the classification accuracy based on 
1-NN classifier (in %) for the UCI, USPS and CUB datasets. 

For the VIPeR dataset, the experimental protocol followed and features used 
are the same as in [16], while revealing (dis)similarity information of only 20% 
random pairs. The matching rate at rank 1 (in %) is used as the performance 
measure. A subset of 11 concepts have been chosen from the NUS-WIDE dataset 
and represented by the normalized CM55 features [5]. The data has been orga- 
nized in 10 different disjoint folds, such that for each fold a subset of 2200 
images is selected in such a way that each concept has 200 relevant images and 
200 irrelevant images from each of the other 10 concepts. Out of this subset 10% 
examples have been chosen to generate the pair-wise constraints. Testing is done 
on a subset of the dataset with 500 images for each concept that have not been 
seen during training. Mean Average Precision (AP) across all concepts based on 
top 10 retrieved images (in %) is used as the performance measure. 


100 T T T T T T 


Performance measure (in %) 


E | bill | 

0 

Iris Wine Bal Diab BCan USPS CUB VIPeR NUS 
Dataset 


Fig. 1. Performance measures (in %, higher the better) obtained using distance metrics 
learned by different approaches across different tasks. 
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Table 1. Average ranks (lower the better) along with standard deviation of the com- 
pared approaches across all the tasks as shown in Fig. 1 


Baseline approaches Proposed semi-supervised approaches 


State-of-the-art | Semi-supervised approaches 
GMML LRML SERAPH APLLR APID APIT 
Avg. Rank +std | 2.33+1.58 | 4.44 + 1.42 | 4.77 + 1.20 3.22 + 2.27 | 3.77 + 1.30 | 2.44 + 1.01 


The appropriate performance measures (as discussed) for all the studied 
tasks, obtained using distance metrics learned by different approaches have been 
collectively shown in Fig. 1. The average ranks along with the standard devi- 
ation of the compared approaches, based on their performance across all the 
tasks/datasets have been shown in Table 1. As seen in Table 1, the proposed SS- 
DML approaches APLLR, APID and APIT outperform the baseline SS-DML 
approaches LRML and SERAPH. This highlights the importance of considering 
an alternative formulation to encode the weak-supervision by following criteria 
other than the min-max principle alone. The proposed approaches also perform 
competitive to the state-of-the-art GMML approach. 

In fact, the proposed APLLR obtains the best performance in the following 
datasets: iris, balance, diabetes and CUB. However, the APLLR approach is 
less stable as well. This is because of the underlying prior metric obtained using 
KISSME approach. It is not surprising because despite its success, KISSME 
requires careful preprocessing and denoising. The invertibility of the scatter 
matrix of the similar pairs involved in KISSME also plays a crucial role, which 
is obviously dataset dependent. On the other hand, the APIT approach is much 
stable and consistent. This can be attributed to the regularizer present in the 
ITML approach. It is noteworthy that despite its naiveness, APID does perform 
well. This shows that propagating information from pairwise constraints to the 
unlabeled data does help, though not significantly in some cases. We believe 
that thresholding out smaller values in the affinity matrix, or considering only 
the top affinities for each element may help reduce noise and improve further per- 
formance. It should be noted that the choice of the prior metric plays a pivotal 
role. 

In order to specifically study the relative improvement obtained by the affın- 
ity propagation alone, the performances of the proposed approaches APLLR, 
APID and APIT have also been compared with the prior metrics obtained by the 
following approaches: KISSME, EUC (Identity matrix as the prior metric) and 
ITML respectively. The comparative performance of the proposed approaches 
with their prior metrics can be seen in Fig.2(a), (c) and (e). In most of the 
cases, the proposed approaches gain an improvement over the prior metrics, 
again highlighting the importance of propagating the information from pairwise 
constraints to the unlabeled data. However, for the image retrieval task in NUS- 
WIDE dataset we observed otherwise. The proposed approaches were performing 
inferior to the prior metrics. Hence, we studied the comparative performance of 
the proposed approaches and the prior metrics on individual concepts of the 
NUS-WIDE dataset. Performances (AP, in %) in five of the eleven concepts are 
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(a) APID vs prior metric of EUC 
T T T T 


T T T 
NUS E 
Pen eo CS [170 
CUB 
==" 
3 USPS A 
$ oon $ oon T 
E c 
3 Diab $ 
ae a 
we A 
Iris 
f 1 1 1 1 f fi f fi fi 1 1 1 1 fi 
0 10 2 30 40 50 60 7 80 90 100 0 5 1 15 2 2 30 35 
Performance measure (in %) Performance measure (in %) 
(c) APIT vs prior metric of ITML (d) APIT vs prior metric of ITML 
T T T T T T T T T T T T 
NUS i C 
ven en ot |) 
= Sy — 
$ USPS A 
Esc e SSS 
El £ 
8 Diab E 
pls a E 
ii e 
Iris. 
1 1 1 1 1 1 1 1 1 1 1 1 f 1 1 
0 10 2 30 40 50 60 7 80 90 100 0 5 1 15 2 2 30 35 
Performance measure (in %) Performance measure (in %) 
(e) APLLR vs prior metric of KISSME (f) APLLR vs prior metric of KISSME 
T T T T T T T T T T T T T 
NUS issue Kise 
Pen hun gaS — eur 
is Sey IÓ IS 
% USPS £ 
$ oon e 
E £ 
S Diab E 
a — e 
"e ‘ oo 
Iris. 
f 1 1 1 1 i f f fi 1 i 1 fi 1 


0 10 2 30 40 50 60 
Performance measure (in %) 


(b) APID vs prior metric of EUC 
T T T 


10 15 2 25 
Performance measure (in %) 


563 


Fig. 2. Comparison of performance measures (in %, higher the better) obtained for the 
proposed approaches with that of the prior metrics. 


shown in Fig. 2(b), (d) and (f). It has been observed that although the proposed 
approaches performed better for 4 concepts (sky, ocean, clouds and animal), 
except APLLR the remaining performed inferior on the sunset concept. Even 
for the remaining concepts (buildings, grass, lake, person, plants and reflection) 
we did not observe any improvement. We suspect that adding unlabeled data for 
these concepts was not beneficial. In such cases it is advisable to simply apply an 
approach like ITML or GMML. It may have happened due to the multi-concept 
nature of the dataset and the incapability of the SS-DML approaches to unravel 
the manifold structure of data of some of the concepts, thus lowering the average 
performance in the NUS-WIDE dataset. 

It should be noted that, as an alternative to the two-stage nature of APSSML 
framework, jointly learning the prior metric Mo and the current metric M using 
an Alternating Optimization scheme could be looked at as a future work. 


5 Conclusions 


A general affinity propagation based topology-preserving semi-supervised DML 
framework has been proposed. By constraining the metric to learn to be closer to 
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a prior metric with respect to the squared Frobenius norm, a closed-form solution 
for the framework has been derived. Different choices for the prior metric have 
been discussed, resulting in new semi-supervised DML approaches which have 
shown competitive performance. 
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Abstract. Prediction intervals offer a means of assessing the uncer- 
tainty of artificial neural networks’ point predictions. In this work, we 
propose a hybrid approach for constructing prediction intervals, com- 
bining the Bootstrap method with a direct approximation of lower and 
upper error bounds. The main objective is to construct high-quality pre- 
diction intervals — combining high coverage probability for future obser- 
vations with small and thus informative interval widths — even when 
sparse data is available. The approach is extended to adaptive approx- 
imation, whereby an online learning scheme is proposed to iteratively 
update prediction intervals based on recent measurements, requiring 
a reduced computational cost compared to offline approximation. Our 
results suggest the potential of the hybrid approach to construct high- 
coverage prediction intervals, in batch and online approximation, even 
when data quantity and density are limited. Furthermore, they highlight 
the need for cautious use and evaluation of the training data to be used 
for estimating prediction intervals. 


Keywords: Prediction intervals - Lower and upper error bounds 
Online learning - Adaptive approximation 


1 Introduction 


The use of Artificial Neural Networks (ANN) in approximating unknown func- 
tions has attracted significant research interest over the last decades [1,2], moti- 
vated by the universal approximator properties of ANN [2]. However, in practical 
scenarios where the function to be approximated is unknown, ANN’s accuracy 
relies on the quality and quantity of the available measurements. Noise-corrupted 
measurements, multi-valued targets along with data uncertainty stemming from 
variabilities of the physical system, significantly impact ANN’s point predictions. 
The reliability of point predictions is further deteriorated in online approxima- 
tion scenarios, whereby the training data might be sparse — especially at initial 
training stages — or might not representatively cover the entire region of interest. 


© Springer Nature Switzerland AG 2018 
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Such issues will likely force the ANN to extrapolate, limiting its generalisation 
ability along with the practical utility of point predictions. As an alternative 
to point predictions, Prediction Intervals (PIs) have been proposed [3-5] which 
provide lower and upper bounds for a future observation, with a prescribed 
probability. From a practical point of view, PIs could be preferable to point pre- 
dictions as they provide an indication of the reliability of the ANN as well as 
enable practitioners to consider best- and worst-case scenarios. For example, PIs 
could be particularly useful in control engineering and fault detection applica- 
tions [6], where uncertainty bounds could help distinguish the healthy operation 
of the system from faulty behaviour. 

A range of methods have been proposed in the literature for constructing PIs 
and assessing the reliability of ANN. Amongst them, the delta technique [3], the 
mean variance estimation method and Bootstrap approaches [4] have been used 
extensively to evaluate PIs on real and synthetic problems. These traditional 
approaches first generate the point predictions and subsequently compute the 
PIs following assumptions on error or data distributions, which might be invalid 
in real world applications. Additionally, as the resulting PIs are not constructed 
to optimise PI quality, they might suffer from low coverage of the training/test 
set or might result in wide, over-conservative error bounds. 

An alternative approach (Lower Upper Bound Estimation (LUBE)) has been 
proposed by Khosravi et al., focusing on directly estimating high-quality PIs, 
while avoiding restrictive assumptions on error distributions [5]. Instead of quan- 
tifying the error of point predictions, LUBE uses ANN to directly approxi- 
mate lower and upper error bounds, by optimising model coefficients to achieve 
maximum coverage of available measurements, with the minimum PI width 
[5,7]. Although LUBE has demonstrated significant potential against traditional 
approaches in terms of accuracy, interval width and computational cost [8,9], it 
is less reliable when limited or non-uniformly distributed training data are avail- 
able [10]. In fact, Bootstrap and delta methods produce wider PIs in regions 
with sparse data, signifying the larger level of uncertainty in ANN approxima- 
tion; capturing model uncertainty is an important feature of PIs [9,11], lacking 
in the LUBE approach which mainly accounts for noise variance. 

In this work, we propose a combination of the Bootstrap and LUBE meth- 
ods, which exploits good characteristics from both techniques. The proposed 
Bootstrap-LUBE Method (BLM) enhances the reliability of the LUBE approach 
when data is sparse or limited, by augmenting the training set with pseudo- 
measurements stemming from Bootstrap replications. The pseudo-measurements 
will present larger variability in regions with sparse data, forcing BLM to pro- 
duce a wider local PI and thus capture the larger uncertainty in approximation. 
Following LUBE, BLM constructs PIs by optimising their coverage and width, 
while at the same time avoiding any assumptions on data/error distributions. 

Another important contribution of this work is to extend the proposed hybrid 
approach to adaptively approximate the PIs during the online operation of the 
system. In cases where data becomes continuously available in a sequential way, 
use of the either LUBE or BLM on the entire current dataset would become 
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infeasible as it would incur a continuously increasing computational cost. At the 
same time, offline estimation of PIs based on past data would likely be unsuitable 
as it would be unable to accommodate dynamic changes in data patterns. We 
propose an online learning scheme for estimating Pls, in which the lower and 
upper bounds are iteratively updated to also account for recent measurements. 
At each iteration only recent data are used in Pl-optimisation, thus significantly 
reducing the computational cost and further enhancing the efficiency of BLM. 


2 Methods 


Throughout this section, we assume that we want to construct a PI for the 
approximation of an unknown function f(x), x € D, where the region of interest 
D is a compact subset of R. Available measurements are denoted by (x;,Y;), i = 
1,--+ , N, which are assumed to be corrupted by noise e (Y; = f(1;)+e;). A Plofa 
predetermined confidence level (1— a) for a future observation Yy +1 consists of a 
lower L(xy+1) and upper bound U(#y+1), denoting that the future observation 
will lie within the interval with a probability 1 — a: 


P(Yn+ € [L(tw4i), Ulenyi)]) =1-a. (1) 


For the Bootstrap method, let us assume that we want to approximate the 
unknown function f(x) with f(x; w,c,o), using a Radial Basis Function (RBF) 
network: 


H la N 2 
fa; w,c,o) = Y wnen(&;cn,on), Gnlz; cn, on) = exp EL, (2) 
h=1 h 


Here H denotes the number of ANN neurons (H = 20 for the tests consid- 
ered) and wp are weighting coefficients scaling the RBF ¢,;. The centres cp are 
evenly distributed over the region of interest and the widths o, are evaluated 
using a nearest-neighbour heuristic, leading to a linear-in-parameter approxima- 
tor f (a; w). The weight vector w can then be estimated by minimising the error 
function yoy — f(x;; w)]? using least squares estimation. 


2.1 Prediction Interval Estimation Methods 


Bootstrap Residual Method. Bootstrap methods rely on multiple pseudo- 
replications of the training set to approximate unbiased estimates of prediction 
errors. Here we concentrate on the Bootstrap residual method, whereby model 
residuals are randomly resampled with replacement. The Bootstrap residual 
method algorithm described in [4] can be summarised as follows: 


~ Get an initial estimate W from available measurements, compute residuals 
ri = Y; — f(x;;0b) and then compute variance-corrected residuals s; [4]. 

— Generate B samples of size N drawn with replacement from residuals 
s1,-** ‚sn, denoted by s?,--- , s% for the bt” sample. For the b*” replication: 
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Fig. 1. Function approximations qe at 50 Bootstrap replications (grey shaded lines). 
The variability among approximations from different replications is significantly larger 
in regions where measurements used for training (red circles) are limited. (Color figure 
online) 


e Generate bt” replication’s “measurements” Y? = f(x,;) + se. 
e Estimate w, by minimising the error SAY? — f (xi; w)]? and calculate 
the Bootstrap approximation f’ (x; wp). 
e Calculate the current estimate for the approximation error eb, 41: 
— Construct PI using percentiles of the error en-+ı. 


LUBE Method. LUBE's cornerstone is the direct approximation of PIs 
using ANNs. Instead of the unknown function f(x), LUBE approximates the 
lower L(x) and upper U(x) bounds using RBFs: L(2;w?) = E$ wh én(a), 
Û (x; wY) = 3 w 4), (1). The main goal is to produce high-quality PIs, where 
quality is assessed using two indices: (a) PI Coverage Probability (PICP) and 
(b) Normalised Mean Prediction Interval Width (NMPIW). In particular, PICP 
is given by: 


1 
L , Uy _ 
PICP(w*,w") = WC (3) 
with C; = 1 if Y, € [L(a,;;w"), U(2;;wY)] and C; = 0 otherwise. Similarly, 
for R denoting the range of observations, NMPIW is given by: 


N 
1 a à 
NMPIW(w!,w") = È Y le; w) - iei wt)/R. (4) 
i=1 
From a practical point of view it is useful to have narrow PIs (small NMPIW) 
which offer high coverage of the measurements (large PICP), leading to the 
following optimisation problem [5,7]: 
Minimise NMPIW(w*,wY) 
1— PICP(w",w"”) 
Subject to NMPIW(w",w") > 0, 


1- PICP(w",w") <a, 
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where a is the desired confidence level (a = 0.05 for the tests considered). Due to 
the complexity of the mutli-objective optimisation problem, weights w? and wÏ 
are estimated using a Non-Dominated Genetic Algorithm II (NSGA-IJ) [7,12]. 
Among solutions with PICP> 1 — a, the solution producing the narrowest PI is 
selected. 


Bootstrap-LUBE Method (BLM). BLM is aiming at combining good char- 
acteristics from the Bootstrap and LUBE methods. The main objective of BLM 
is to directly estimate PIs by optimising their quality (similar to LUBE), while 
at the same time accounting for model uncertainty (similar to Bootstrap). 

In fact, Bootstrap produces wider bounds in regions with sparse data, cap- 
turing the larger model uncertainty while the LUBE approach which mainly 
accounts for noise variance lacks this feature (Figs. 1, 2 and 3). Looking closer 
into Bootstrap (Fig. 1), there is significant variability between the Bootstrap 
approximations fe from different replications in regions with sparse data, most 
likely due to extrapolation. In such regions the error at each replication will be 
large leading to large regional error variance and wide regional error bounds. 

The main idea of BLM is to enrich the N available measurements with 
pseudo-measurements originating from the Bootstrap approximations ( P), to 
force BLM to account for data density. We first define an auxiliary set of points 
(25, j =1,--- , Naux) evenly distributed in the region of interest. We then com- 
pute the Bootstrap approximation of each replication for all of the x* points 
ler); b=1,---,B, j =1,---,Nauz) which will lead to B - Naya pseudo- 
measurements (light blue dots in Figs. 2 and 3). The multi-objective optimisation 
problem of LUBE is now augmented to finding w% and wY which: 

Minimise NMPIW(w",w") + NMPIWpseudo(w" , w? ) (9) 
1— PICP(w*,w") (10) 
Subject to NMPIW(w",w") > 0, (11) 
1- PICP(w*,w") <a, (12) 
1 — PIC Poseudoltw”,w0%) < 0.01, (13) 


where PICP and NMPIW are computed over the N actual measurements, and 
PICP pseudo and NMPIW pseudo are computed on the pseudo-measurements. With 
the BLM formulation the PIs will be forced to be wider in regions with sparse 
data (where pseudo-measurements will present substantial variations), indicating 
larger model uncertainty. At the same time, regions with dense data will not be 
affected, as the variation in pseudo-measurements will be small (the Bootstrap 
approximation in those regions is similar throughout replications (Fig. 1)). 


2.2 Online Estimation of Prediction Intervals 


During the online operation of a system where data becomes available in a 
sequential manner, use of either LUBE or BLM on the entire current dataset 
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would become infeasible. To this end, we propose an online approximation 
scheme which takes into account past and current data, in a computationally 
efficient way. Based on a weighted sliding window learning scheme, the lower and 
upper bounds are iteratively updated at specific time instances. 

In particular, the lower and upper bounds’ weights (condensed into vector 
w) are first trained on the N; initial measurements, leading to estimate wi. 
Assuming a continual and uniform in time inflow of measurements, the bounds 
are updated at the first sliding window when N;+N,, measurements are available 
(Nw < Ni): 

Wi : Ww Nw x 

Ni + Nw Ni + Nw 
Here w, denote the weights of the lower and upper bounds estimated with 
multi-objective optimisation based only on the most recent N,, measurements of 
the current window. The contribution of the recent measurements in the current 
weights’ evaluation is determined by the ratio of measurements in the current 
window (Nw) to the total number of available measurements (N;+N,,). Similarly, 
for the kt” window, the weights will be iteratively updated to account for past 
and current measurements with equal contributions: 


w(N; + Nw) = + (14) 


N; + (k —1)Ny Nw 


MONET May) a Nec A TN) gy OO EN 


(15) 


For each window only N,, measurements are used in the optimisation, signif- 
icantly reducing the computational cost of the optimisation problem. Note that 
when BLM is used, the weights are estimated using the measurements of the 
current window as well as the auxiliary Bootstrap-based measurements. 


3 Results and Discussion 


3.1 Comparison of Prediction Interval Estimation Methods 


The methods for constructing PIs described in Sect. 2.1 are tested and compared 
on synthetic tests. Of interest in this work is the quality of the PIs when non- 
uniformly distributed or sparse data are available. Accordingly, as we are investi- 
gating extreme scenarios, the training data are generated from random uniformly 
distributed data under specific restrictions. In particular, we are replicating two 
scenarios: (a) the training data do not representatively cover the entire domain, 
but only regions of it (Fig. 2), (b) very few training data are available over the 
entire domain (Fig.3). For both scenarios the test data are uniformly covering 
the entire domain, to enable reliable assessment of PI accuracy. 

Two functions to be approximated are considered (f(x) = 0.5sin(1.57a + 
m/2)+2, falx) = 5sin(7a+7/2)+exp(x)). Training and test data are generated 
based on these functions and white Gaussian noise of 10% of the mean function 
value is added. For both functions we consider the two training scenarios, leading 
to the following tests: Testl: PI for regional data generated from fı, Test2: 
PI for regional data generated from fa, Test3: PI for sparse data generated 
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from fi, Test4: PI for sparse data generated from f2. For Testl and Test2, we 
consider 100 training points, while 15 training points are considered for Test3 
and Test4. Every test is repeated 10 times with different randomly generated 
training data, to enable a more reliable comparison of the methods. Table 1 
presents PI quality indices for all methods, averaged over the 10 replications 
of each test. Representative PIs are demonstrated in Fig.2 for scenario (a) and 
Fig. 3 for scenario (b). 


PICP, |, = 0.89, NMPIW, |, = 0.42 p PCP, une = 0:72, NMPIW ype = 0-3 1 PICP,,, 14 = 0.97, NMPIW, y = 0.6 


SE 
= Upper bound il 
Lower bound Ww 
Training dataset 
+ Test dataset 
05 function 


PICP oot = 0:93, NMPIW, oot 7 0.14 A PICP use = 0.59, NMPIW, gg = 0.11 PICPz m = 0.91, NMPIW¿, yy = 0.18 


= Upper bound 

— Lower bound 
Training dataset 

* Test dataset 

| |—‘tunetion 


Fig. 2. PIs constructed using the Bootstrap (left column), LUBE (middle column) and 
BLM (right column) approaches. Data limited to certain regions of the domain following 
scenario (a), originate from fı (Test1, top row) and f2 (Test2, bottom row). Light blue 
dots indicate Bootstrap pseudo-measurements used by BLM. PICP and NMPIW are 
evaluated on the test dataset, uniformly covering the entire domain. (Color figure 
online) 


Across the tests considered BLM clearly outperforms LUBE method in terms 
of coverage, with an average increase of 15-30% in PICP. By considering Boot- 
strap pseudo-measurements, BLM is able to produce larger bounds in regions 
with fewer data, providing an indication of the uncertainty in the estimation. 
Additionally, due to BLM’s optimisation of PI quality, BLM produces a better 
coverage compared to Bootstrap in the majority of tests. Increased PICP comes 
at the cost of wider PIs, nevertheless, the fundamental requirement for a PI to 
reliably include future observations is clearly prioritised over narrow — yet invalid 
— Pls. 

Finally, it is worth noting that BLM is performed on a larger number of 
training measurements compared to LUBE, without significantly impacting the 
computational cost. The increased cost in computing PICP and NMPIW over 
the pseudo-measurements is not substantial (note that B and Nau, do not need 
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Fig. 3. Pls constructed using the Bootstrap (left column), LUBE (middle column) and 
BLM (right column) approaches. Sparse data originate from fı (Test3, top row) and fa 
(Test 4, bottom row) following scenario (b). Light blue dots indicate Bootstrap pseudo- 
measurements used by BLM. PICP and NMPIW are evaluated on the test dataset, 
uniformly covering the entire domain. (Color figure online) 


to be very large to enable BLM to account for data density), while the dimensions 
of the parameters (w? and wY) to be estimated remain the same. 


Table 1. Average characteristics of the PIs constructed for four synthetic tests, using 
the Bootstrap, LUBE and BLM approaches. PICP and NMPIW are evaluated on the 
test dataset, uniformly covering the entire domain. 


Tests | Bootstrap LUBE BLM 

PICP(%) NMPIW(%) PICP(%) | NMPIW(%) | PICP(%) | NMPIW (%) 
Test1 | 83.23 43.54 74.33 37.68 89.75 57.63 
Test2 | 65.06 20.74 62.47 18.61 91.76 41.00 
Test3 | 65.27 33.56 66.67 29.72 94.68 62.68 
Test4 | 65.72 10.77 64.97 9.23 90.30 24.93 


3.2 Online Estimation of Prediction Intervals with LUBE and BLM 


The proposed online learning scheme (Eq. 15) is compared against batch (offline) 
estimation using both the LUBE and BLM approaches. Initially, LUBE is used 
with N; = 100 initial training points and N,, = 10, subsequently with N; = 1000 
and Nw = 100 and finally BLM is used with N; = 100 and N,, = 10. In all tests 
k = 10 sliding windows are considered, and each of the three cases is repeated 
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Fig. 4. PIs constructed using batch (top row) and online (bottom row) estimation. 
Online PI estimation is tested using LUBE on N; = 100 and Ny = 10 training points 
(left column), using LUBE on N; = 1000 and Nù = 100 training points (middle 
column) and using BLM on N; = 100 and N,, = 10 training points (right column). 


Table 2. Average characteristics of the PIs constructed using the LUBE and BLM 
methods, based on batch or online approximation. 


LUBE (Nw = 10) LUBE (Nw = 100) BLM (Nw = 10) 
Estimation PICP(%) NMPIW(%) | PICP(%) NMPIW(%) | PICP(%) NMPIW(%) 
Batch 94.95 54.80 94.84 42.64 92.50 46.60 
Online 76.14 64.41 95.95 43.94 89.57 43.93 


10 times. For batch approximation all N; + kN, training points are used for 
PI optimisation. Representative results are presented in Fig.4 and average PI 
indices in Table 2. 

When LUBE is used with only Ny = 10 training points, online results are 
suboptimal compared to batch approximation. This is due to the fact that 
LUBE’s accuracy suffers when only sparse data is available (as demonstrated 
in Fig.3 and Table 1). This issue can be alleviated by increasing the number of 
training points (N, = 100), in which case online estimation with LUBE is able 
to provide very similar PIs to batch estimation, and in a much more efficient way. 
Alternatively, BLM is able to provide very similar PIs through online and batch 
estimation without increasing the number of training points as it is designed to 
provide reliable bounds even when trained on sparse data. 

It is worth noting that the proposed learning scheme can easily be adjusted 
to accommodate the needs of the specific application. For example, the relative 
contribution of the current sliding window could be increased in cases where 
recent measurements are considered more critical than past measurements. 
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4 Conclusions 


Combining Bootstrap with LUBE method enables BLM to present improved 
characteristics in terms of coverage, compared to both Bootstrap and LUBE 
approaches. In particular, BLM can provide high-coverage PIs even when lim- 
ited data are available, clearly outperforming the LUBE approach. The results 
highlight the fact that even commonly used methods such as Bootstrap might 
provide unreliable PIs when the bounds are based on limited or sparse data, an 
issue that should be carefully considered by ANN practitioners. Finally, extend- 
ing BLM to online approximation constitutes a significant improvement, as it 
enables the efficient and reliable construction of PIs even when approximating 
dynamically changing processes. 
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Abstract. The estimation of microphysical parameters of pollution (effective 
radius and complex refractive index) from optical aerosol parameters entails a 
complex problem. In previous work based on machine learning techniques, 
Artificial Neural Networks have been used to solve this problem. In this paper, 
the use of a classification and regression solution based on the k-Nearest 
Neighbor algorithm is proposed. Results show that this contribution achieves 
better results in terms of accuracy than the previous work. 


Keywords: LIDAR - Particle extinction coefficient - Particle backscatter 
Effective radius - Complex refractive index - K-Nearest Neighbor 


1 Introduction 


One of the main important factors that drive climate change is particulate pollution [1]. 
To understand atmospheric temperatures changes that cause climate change, it is 
necessary to study and characterize the optical, chemical and microphysical properties 
of these particles in the atmosphere. Some technologies like radiometers or Light 
Detection and Ranging (LIDAR) make possible the observation of aerosols. LIDAR 
has become a key tool for the characterization of atmospheric pollution in the atmo- 
sphere. LIDAR is the only remote sensing technique used in research on atmospheric 
pollution that allows for vertically-resolved observations of particulate pollution, for 
example, [2]. Using LIDAR, optical aerosol parameters can be extracted [3] but more 
information about particles is required to understand the impact of pollution on climate 
change. 

Microphysical particle parameters are also of key interest to determine pollution 
effects. In [4-8], inversion algorithms are used to estimate microphysical information 
(particle size or complex refractive index) from optical data. Their estimation is a very 
complex task because many factors such as ambient atmospheric humidity, the 
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condensation of gases on existing particles or the mixing of particles of different 
chemical properties modify the values of the optical data [9]. Due to these difficulties, 
inversion algorithms are very complex and require an extensive mathematical back- 
ground as we are dealing with ill-posed inverse problems [10]. 

Therefore, less complex solutions must be proposed as we need techniques that 
(a) allow for fast data processing in view of current and up-coming LIDAR space 
missions; (b) offer autonomous data retrieval in view of serious lack of experts in this 
research field; and (c) provide us with ways of exploiting the information content of 
these highly complex data sets in an optimum way. Using synthetic optical data, 
authors in [9] have developed a computational model using Artificial Neural Networks 
(ANNs) [11] to estimate the effective radius of particles (re) and the complex refractive 
index from combinations of extinction (a) and backscatter (B) coefficients. Specifically, 
these authors use values of a and ß at different wavelengths (A = 355, 532 and 
1064 nm). These wavelengths are currently used by most of the LIDAR system in the 
world for the investigation of particulate pollution in the atmosphere. Most notably and 
in view of advantages not further detailed in this contribution, there has been a push for 
emitting all three wavelengths simultaneously in the past 20 years. Five combinations 
of a and ß were tested, resulting in finding the most suitable one which uses the values 
of wat A = 355 and 532 nm (355 and 532) and the values of f at à = 355, 532 and 
1064 nm (B355, Bs32 and Bros4). For technical reasons the measurement of o at 
1064 nm has become possible just recently. The quality of these data, however, still 
needs to be improved before tests with this extended set of B + a data can be carried 
out. Moreover, ANNs were evaluated for three different size ranges of effective radii, 
that is, particles with reg between 10-100 nm, 110-250 nm and 260-500 nm, 
respectively. This separation was performed by hand due to two reasons: (a) to limit the 
computation time and (b) to separate particles according to their nature. Without going 
into further details particles in these three different size ranges have different effect on 
climate change and human health. 

The aim of this work is to investigate whether we can develop a computational 
method based on ML techniques which can first classify particles in to the three cate- 
gories, then, can estimate particle properties within each category, or not. In addition, we 
look for a model with less computational cost than the one proposed in [9]. 


2 Data Description 


The dataset is the synthetic one used in [9], which was generated using a Mie scattering 
algorithm [12]. It contains 1,665,343 particles. According to the three ranges of rer, 
there are 330,480 particles with a radius between 10 nm and 100 nm, 503,155 samples 
with a radius between 110 nm and 250 nm and 831,708 particles with a radius between 
260 nm and 500 nm. 

The following information for each particle size distribution can be found: 


— Extinction and backscatter coefficients at different wavelengths (355, 532 and 
1064 nm). As in [9], the best combination of a and ß will be used (01355, 0532, B355, 
Bs32 and B1064). 
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— Mode width, from 1.4 to 2.5 in step of 0.1. The mode width is the geometrical 
standard deviation of the theoretical model that describes in an approximate manner 
the shape (number concentration versus particle size) of naturally occurring 
atmospheric size distributions. This shape, referred to as logarithmic-normal can be 
described as a Gauss distribution if particle radius is plotted on a logarithmic scale. 
More details can be found in e.g. [13]. The total number of particles in the atmo- 
sphere can be modeled by a sum of sets (distributions) according to the radius. 
Particle size distributions in the atmosphere can be described by 5-6 different 
modes. Each mode has its own mean radius (or alternatively we can also use mode 
radius) which is the value where the size distribution reaches its maximum value) 
and the mode width. In the present case we simplified our simulations in the sense 
that we did not use combinations of these modes in order to cover the vast size 
range of particles from a few nanometers to several tens of micrometers. In this first 
set of studies we mimicked these naturally-occurring multimodal size distributions 
by the use of single-mode logarithmic-normal (log-normal) distributions which not 
only cover the relevant particle radius range but result in sufficiently realistic optical 
properties as well. Furthermore, effective radius is a commonly used number in 
climate modeling. It reduced the complexity of size distribution information from 
mean radius and mode width (in each mode) to a single number. Alternatively 
effective radius can also be used for each individual mode. Optical properties of 
particle size distributions described in terms of effective radius are sufficiently close 
to optical properties of the underlying size distribution when used in modeling. 

— Mean radius (nm), from 10 nm to 500 nm in step size of 10. 

— Real (from 1.2 to 2 in step size of 0.025) and imaginary part (from 0 to —0.1 in step 
size of 9.99-10 ©) of the complex refractive index. This index indicates the atten- 
uation suffered by light when passing through a particle. 


Figure 1 shows how a and f vary with respect to X for a mode width equal to 1.4. 
Looking at this figure, the reader can note that a has quite similar values at the different 
wavelengths, while ß decreases as the wavelength increases. It must be said that similar 
behavior is observed in the rest of the mode widths (from 1.5 to 2.5). It can be seen that 
the backscatter coefficient increases as the effective radius increases. Those variations 
are larger for higher mode widths. Similar observations can be found in the variations 
of a. 

Furthermore, the variation of a and ß across the particle effective radius in different 
width modes are also investigated. Figures 2 and 3 show the variation of B in two 
different width modes, 2.0 and 2.5, namely. It can be seen that the backscatter coef- 
ficient increases as the effective radius increases. These variations are larger when the 
value of the mode width is higher. Similar observations can also be found in the 
variations of a. 

To determine which ML techniques can be applied to the classification and esti- 
mation stages, first Principal Component Analysis (PCA) has been applied. Figure 4 
shows a PCA plot of the original synthetic data. It can be seen that the class of 110- 
250 nm overlaps with the class of 10-100 nm and the class of 260-500 nm, but the 
class of 10-100 nm and the class of 260-500 nm do not overlap between them. 
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Fig. 1. Variation of a and ß in mode width 1.4 at the different values of X. 
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Fig. 2. Variation of f with respect to effective radius in mode width equal to 2.0. 


According to Fig. 4, k-NN [14] can be considered as a solution because the three 
classes are not strongly overlapped. In addition, Extreme Learning Machine 
(ELM) [15] has been chosen to be a potential solution since this deep learning solution 
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Fig. 3. Variation of B with respect to effective radius in mode width equal to 2.5. 


uses a similar ANNs architecture to the one used in [9] and it can have a lower 
computational cost. 


3 Retrieval of Microphysical Parameters 


The estimation of microphysical parameters using ANNs was addressed in [9]. Due to 
the computational cost and the nature of particles, this estimation was performed in 
three different ranges of particle sizes separately. Taking this into account, we propose 
two solutions in this paper. The first one is a single regression solution, which uses an 
ML technique to estimate microphysical parameters for all the particles together. This 
technique must outperform ANNs in terms of accuracy. The second one includes two 
steps: (1) a classification that will separate particles into the three classes and then (2) a 
regression that will estimate microphysical parameters. 


3.1 Single Regression Solution 


In work [9], a Multi-Layer Perceptron (MLP) with one hidden layer was used to 
estimate microphysical parameters. Specifically, different configurations of MLPs 
(training algorithms, number of neurons in the hidden layer, activation functions) were 
tested. They concluded that the most suitable MLP contains five neurons in the hidden 
layer and uses the Levenberg Marquardt training algorithm. In this sense, we have 
implemented the same MLP to be compared with ELM and k-NN both in terms of 
accuracy and computational cost. 
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3.2 Combined Solution 


Figure 5 shows a flow diagram of this combined solution. It can be seen that five- 
feature vectors are classified into three classes according to reg first. Then one single 
regression estimation model is trained within each class. In both classification and 
estimation stages, the suitability of using ELM and K-NN will be evaluated. The 
detailed solution is given as follows: 


The whole dataset has been split into training (75%) and test (25%) sets. 
Training a classification model using the training set. 

The test set has been split into the three classes. 

Within each class, a regression model is trained. 

Finally, the regression models in step 4 were used to estimate microphysical 
parameters on the test set. 


APRA 


Classification 
stage 


Fig. 5. Scheme of the full solution combining classification and regression estimation. 
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4 Experiments 


In this section the results of both solutions are provided. To evaluate the classification 
solution, precision, recall and Fl-score [16] are calculated for each class since the three 
classes are imbalanced. To test the suitability of each microphysical parameter esti- 
mation solution, the Root Mean Square Error (RMSE) has been used. The computa- 
tional cost of each technique is evaluated with CPU time. 


4.1 Single Regression Solution 


In Table 1, ELM and k-NN are compared with the MLP configuration in the previous 
work, when all the particles are analyzed together. For the MLP, the whole dataset has 
been split into training (65%), validation (15%) (where the validation set is used to 
control overfitting) and test (25%) sets while for k-NN and ELM, the dataset has been 
split into training (75%) and test (25%). Note that the results for the different classes are 
also extracted from the total results and shown in the table. 

Due to the space limit, only the results for the best configuration of each method are 
presented in the table. For instance, the methodology based on k-NN has been tested 
for different number of neighbours (1, 3, 4, 5, 7, 11, 15, 19, 29 and 39), resulting in 
k = 1 being the best choice. In the case of ELM, different activation functions (sig- 
moid, sine or hard limit) and number of neurons in the hidden layer (N = 2, 3, 4, 5, 7, 
10, 20, 30, 50, 75, 100, 150, 200 and 300) have been tested. Sigmoid function and 
N = 300 achieve the best results. 

Looking at Table 1, it is clear that k-NN produces the best results (the lowest values 
of RMSE) for all the parameters and across all class of particles. 


Table 1. RMSE obtained by MLP [9], ELM and k-NN based solutions when reg, real and 
imaginary part of the complex index are estimated. 


Param. Method Whole data | 10-100 nm | 110-250 nm | 260-500 nm 
Tef MLP [9] 43.40 14.51 37.97 53.03 
kNN (k= 1) 31.18 6.17 27.80 38.24 
ELM (N = 300) | 44.37 15.29 39.82 53.74 
Real part | MLP [9] 0.15 0.19 0.14 0.14 
kNN (k= 1) 0.08 0.13 0.07 0.07 
ELM (N = 300)| 0.19 0.23 0.17 0.18 
Imag. part | MLP [9] 0.18 0.25 0.16 0.15 
kNN (k= 1) 0.08 0.10 0.09 0.06 
ELM (N = 300)| 0.26 0.29 0.27 0.26 


Another aspect to be considered is the computational cost associated with each 
solution. In some applications, study of the atmospheric pollution must be in real-time 
and so, it is important to have fast algorithms. Bearing this in mind, Table 2 shows the 
mean values of CPU times of each solution for training and test. These experiments 
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have been carried out on a computer with a 2.8 GHz Intel Core i7 processor and 8 Gb 
RAM. 


Table 2. Mean values of CPU time (s) associated with MLP [9], ELM and k-NN based 
solutions. 


MLP [9] k-NN ELM 
Training | 753.86 |7.25 | 181.79 
Test 0.44 13.00 5.29 


It is clear that the solution based on k-NNs is less expensive in terms of CPU time 
when training. In test, the MLP needs less time but it is less accurate obtaining 
parameters. In the case of ELM, larger values of CPU time are required and it achieves 
worst RMSE values. For these reasons, larger number of neurons have not been studied 
with ELM. 


4.2 Combined Solution 


As it is shown in Fig. 5, the combined solution allows us to split particles into the three 
classes (10-100 nm, 110-250 nm, 260-500 nm) before estimating parameters. At the 
classification stage, MLP, k-NN and ELM based classifiers have been studied with k- 
NN giving the highest accuracy (96.09%) when k = 3. Since classes are unbalanced, 
precision, recall and Fl-scores are also provided for each class in Table 3. 


Table 3. Precision, recall and Fl-scores obtained by the solution based on k-NNs (k = 3) for 
each particle class. 


Class Precision (%) | Recall (%) | Fl-score 
Class 1 (10-100 nm) | 98.76 97.69 0.98 
Class 2 (110-250 nm) | 93.63 93.43 0.94 
Class 3 (260-500 nm) | 96.52 97.07 0.97 


Very good results are achieved, the worst being for class 2, what makes sense, since 
it overlaps with the rest of the classes (Fig. 5). 

Once particles in the test set have been separated using k-NN, the estimation 
models must be applied. Obtained RMSE scores are presented in Table 4. 

It can be seen from this table, the combined solution based on k-NN achieves the 
best results for all the classes and microphysical parameters. If we compare these 
results with those in Table 1, we can see that in general the combined method performs 
better or similar to the single regression solution when the classification stage has been 
applied previously. Moreover, it shows ELM performs much better after classification 
is done first, that is, it performs better within each class. As for the computational cost, 
similar conclusions as those from Table 2 can be made. 


Estimation of Microphysical Parameters of Atmospheric Pollution 587 


Table 4. RMSE obtained by MLP [9], ELM and k-NN based solutions when rer, real and 
imaginary part of the complex index are estimated for combined solution. 


Param. Method 10-100 nm | 110-250 nm | 260-500 nm 
leg MLP [9] 10.56 30.29 51.17 
KNN (k= 1) 5.86 27.83 37.82 
ELM (N = 300) | 11.01 32.38 50.79 
Real Part | MLP [9] 0.22 0.16 0.17 
KNN (k = 1) 0.09 0.07 0.07 
ELM (N = 300)| 0.21 0.15 0.17 
Imag. Part | MLP [9] 0.27 0.27 0.15 
KNN (k= 1) 0.06 0.08 0.07 
ELM (N = 300)| 0.24 0.25 0.23 


5 Discussion and Conclusions 


Estimating microphysical parameters from optical data can be done using ML tech- 
niques. [9] is a very interesting paper and from it, two objectives have been met in our 
work. Firstly, we provide a solution that produces lower RMSE and computational cost 
when estimating microphysical parameters. Secondly, a new combined solution, which 
produces high accuracy at the classification stage, has been implemented. 
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Abstract. This paper is intended to bring added value in the interdisciplinary 
domains of computer science and psychology, more precisely and in particular, 
automated learning and applied psychology. We present automated learning 
techniques for classification of new instances, new observations of a patient, 
taking into account the particularities of the attributes describing each of these 
observations. Specifically, information collected by applying a questionnaire for 
communication style (non-assertive style, manipulative style, aggressive style 
and assertive style) was analyzed. 

Through these experiments, we have tried to determine which of the classi- 
fication models are best suited to be applied in specific situations and, given the 
type of attributes that make up the instances of the dataset, what kind of pre- 
processing methods can be applied to get the most qualitative results using the 
selected classification models: Decision Tree Based Model, Support Vector 
Machine, Random Forest, Classification based on instances (k-NN), and 
Logistic Regression. Standard metrics were used to evaluate the performance of 
each of the analyzed classification patterns: accuracy, sensitivity, precision, and 
specificity. 


Keywords: Multi-classification model - Prediction - Communication style 


1 Introduction 


The use of the techniques offered by the Artificial Intelligence field becomes a 
necessity every day. One of the reasons that support this need may be that intelligent 
techniques can bring added value, even help in people’s work. Lately, the number of 
areas of applicability of these techniques has increased. 

Classification is an automated learning technique, particularly supervised learning 
that requires labeled training data to generate rules. This is a two-stage process [14]. 
The first stage is the learning stage in which the training set is analyzed and new 
classification rules are generated, and the second step is the classification itself, in 
which the testing set is split into appropriate classes based on the rules established in 
the previous step. 

Within this classification process, the classes are defined based on the attribute 
values of the instances in the datasets. In view of these considerations, the classification 
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process can be of several types: binary classification, multi-class classification, multi- 
label classification and multi-task classification [3, 14]. 

In this article, we will consider studying the multiple classification (or multi- 
classification). Multi-classification assumes that each instance belongs to only one 
labeled class of a total of n labeled classes (with n > 2). 

The classification models analyzed in our experiments are: Decision Tree Based 
Modeling Model, Support Vector Machine (SVM) based classification model, Random 
forest classification based on a Random forest, Classification based on instances (k- 
NN), the logistic regression classification model. 

Measures that quantify the performance of each of the analyzed classification 
patterns are some of the standard metrics to evaluate the quality of a model: accuracy, 
precision, sensitivity, and specificity. 

The analyzed data set is a set of data in the phycology domain, consisting of 220 
instances and 63 mixed attributes. The set contains information on how people com- 
municate. The target variable of this dataset is the communication style appropriate to 
each of the individuals analyzed, based on the given responses. 


2 State of the Art 


Moving beyond the initial focus of psychology of studying and explaining human 
behaviors, many researchers emphasize the need to advance further into predicting 
future behaviors. In this regard machine learning can be a very accurate tool, proving 
these concepts and helping answer important research questions from psychology [18]. 

In what concerns communication styles, it is of a great social impact to predict 
accurately its impact in diverse socio-economic fields. For example, charity fundraising 
is influenced by the interaction of both helper and receiver, especially if the receiver is 
considered agreeable [19]. 

A research study examining the feasibility of using machine learning for predicting 
psychological wellbeing indices [20] has obtained promising results using Support 
Vector Regression (0.76), Generalized Regression Neural Network (0.80), and k 
Nearest Neighbor Regression (0.70), but poor results with Multi-layer Perceptron, 
Given the impact of stress on human behavior, Pandey et al. [22] use machine learning 
to predict stress based on heart rate and age using Logistic Regression, Support Vector 
Machine (SVM), VF -15 and Naive Bayes, with results up to 68% for SVM. 


3 Methods and Experiments 


All experiments follow next stages: 

Stage 1: Data preparation; 

Stage 2: Training the classification model; 
Stage 3: Evaluating the classification model. 
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3.1 Dataset Descriptions 


A. Psychological Perspective Over Dataset Characteristics 
The data analyzed in this experiment were obtained by applying the Questionnaire for 
the Communication Style (QCS) analysis [16] to a group of 220 people. The group is 
homogeneous, from the perspective of age and the living environment, all the persons 
questioned belonging to the urban environment. 
The questionnaire for communication style analysis consists of a set of 60 questions, 
whose answers can be yes/no (or true/false) [16]. 

The participants of this study were informed and they agreed to the fact of using, in 
research purposes, the results of the presented questionnaire. 

Each of the 60 questions is part of a category of communication style as follows: 1, 
7, 15, 16, 17, 25, 26, 35, 36, 37, 50, 51, 52, 59, 60 for the non-assertive style; in the 
category of aggressive style we have the questions 4, 6, 10, 11, 20, 21, 28, 29, 30, 39, 
40, 48, 49, 55, 56 and in the category of manipulative style are questions 3, 5, 9, 12, 13, 
19, 22, 31, 32, 41, 42, 46, 47, 54, 57, and finally in the assertive style category we have 
the following questions: 2, 8, 14, 18, 23, 24, 27, 33, 34, 38, 43, 44, 45, 53, 58. 

As a result, this questionnaire highlights four different styles of communication: 
non-assertive style, manipulating style, aggressive style and assertive style. 

Besides the answers that make up the questionnaire, the people involved also 
quantified their stress level, referring to three possible degrees: low, medium, high. 

This stress-related information attempts to highlight whether there is a link between 
the level of stress and the style of communication of a particular person. The applied 
questionnaire was proposed by Marcus [21]. 


B. Computer science perspective over dataset characteristics 
Taking into consideration all of the above informative aspects, translating them into a 
form conducive to computer-based analysis, we have: 


e We have a set of data that contains 220 instances (observations) with the following 
characteristics: a discrete attribute of the nominal type (gender of the person), a 
numerical attribute (age), a discrete type of ordinal type (the level of stress) and the 
60 binary attributes (true/false) corresponding to the questionnaire. 

e Each of the 220 instances of the dataset belongs to one of the four possible classes 
(non-assertive style class, manipulator style class, aggressive style class, and 
assertive style class). 

e Consequently, we have a mixed data set divided into four existing classes. 


With all this information, our goal in this research is first and foremost to determine 
whether there is a link between the attribute level of stress and the style of commu- 
nication of a particular person, and secondly to determine in a way more precisely the 
classification of a new data instance. The 4 classes corresponding to the four styles of 
communication (non-assertive style, manipulative style, aggressive style and assertive 
style) lead to the idea of multi-classification. 
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4 Classification Methods Used 


The purpose is to determine a classification model that finds the relationship between 
the attribute associated with the class, in our case the communication style, and the 
other existing attributes. This constructed model must be able to determine subse- 
quently the class to which a new instance belongs. 

The used models in our experiments are: 


Decision tree-based classification model (DT) [5] 

Support Vector Machine based classification model (SVM) [9, 10] 

Random forest classification model (RF) [12] 

The instance-based classification model (kNN) [4, 8] 

The Bayesian-based classification model (NB) [7, 13] 

The logistic regression classification model (mLR) reproduces the relationship 
between a set of independent variables x; (categorical, continuous) and a dependent 
variable (nominal, binary) Y. 


The multinomial logistic regression model (also known as multinomial logistic 
regression - or discrete choice model) is a generalization of the logistic model, 
accepting that the dependent variable Y has more than two values. Assuming that the 


variable Y has as possible values the elements of the unordered set (1, ..., g}. The 
multinomial logistic model assumes that the probability of Y to be equal to s in the 
observation i depends on the values of the variables x;,, ..., Xip by [6, 11]: 
ellis 
P(Y; = s) (1) 


= Dea ellit 


5 Data Preprocessing 


The dataset analyzed in this article contains mixed types of data, therefore we have 
considered some preprocessing operations, detailed below. 
First, for the age attribute, we considered the preprocessing operation called stan- 
dardization to eliminate the influence of different scales relative to the other attributes. 
Second, it was important to consider how to calculate distances (for 
dissimilarity/similarity measures), considering that we had different types of attributes, 
as follows: 


e For nominal attributes, the distance between the values of an attribute is considered 
0 if the values are equal and 1 if they are different. There are no other relationships 
between attribute values and there are no intermediate distances between 0 and 1. 

e For ordinal attributes, we may consider equally distributed values between 0 and 1. 
For example, in the case of the three stress levels {Low, Medium, High}, we used 
the transformation: Low = 0, Medium = 0.5, and High = 1. Therefore, the calcu- 
lation of the distances is the absolute value of the difference between these 
numerical values, such as: d (Low, High) = 1, d (Low, Medium) = d (Medium, 
High) = 0.5. 
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e For numerical attributes, the distance is the absolute value of the difference 
between the normalized values of the attributes. 


In the case of some missing attributes, we’ve replaced those attributes with the 
average of the existing attributes, and in terms of attribute hierarchy, we did not apply 
any prioritization methods for these experiments. 

Another preprocessing step, the method of dividing the data into subsets, was done 
taking into account cross-validation, a method which involves the division of the data 
k times (k-1 sub-sets for training and / sub-set for validation). In this context, the size 
of a sub-set is equal to the size of the set divided by k and the performance is given by 
the average of the k executions [15]. 


6 Evaluation of the Classification Models 


The quality of a classifier from the perspective of the correct identification of a class is 
measured using information in the confusion matrix containing [1, 4]: 


e The number of data correctly classified as belonging to the interest class: True 
positive cases (TP) 

e The number of data correctly classified as not belonging to the class of interest: 
True negative cases (TN) 

e The number of data incorrectly classified as belonging to the interest class: False 
positive cases (FP) 

e The number of data incorrectly classified as not belonging to the interest class: False 
negative cases (FN) 


6.1 Metrics Used for Evaluation 


The metrics that measure the quality of the evaluation used in this article are described 
blow. 

Classification accuracy is determined by the ratio of the number of correctly 
classified instances to the total number of classified instances [2]. 


i TP + TN 0) 
ccuracy = 
Y  TP+TN + FP + FN 


The sensitivity metric is given by the ratio between the number of correctly clas- 
sified data as belonging to the class of interest and the sum of the number of data 
correctly classified as belonging to the interest class and the number of data incorrectly 
classified as not belonging to the class of interest. 


TP 


TP+FN (3) 


Sensitivity = 


The metric of specificity is given by the ratio of the number of data correctly 
classified as not belonging to the interest class and the sum of the number of correctly 
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classified data as not belonging to the interest class and the number of data incorrectly 
classified as belonging to the class of interest. 


TN 


TN + FP 4) 


Specificity = 

The precision metric is given by the ratio between the number of data correctly 

classified as belonging to the class of interest and the sum of the number of data 

correctly classified as belonging to the interest class and the number of data incorrectly 
classified as belonging to the class of interest [17]: 


= TP 
Precision = === (5) 
TP+FP 


6.2 Results and Discussion 


The value of the accuracy metric for each of the classification models analyzed in our 
experiments are shown in Fig. 1. 


Classifiers Accuracy Value 


0.94 
0.92 

0.9 
0.88 
0.86 
0.84 
0.82 

0.8 


Naive Bayes - Logistic Decission Random 
Regression Tree Forest 


Fig. 1. Accuracy values for the analysed models 


Both the Random Forest and Support Vector based algorithms are non-parametric 
models (their complexity increases as the number of instances used for training 
increases) [15]. That being said, the training of a non-parametric model is costlier, from 
a computational perspective, compared to a generalized linear model (k-NN, Naive- 
Bayes, in our case). However, the added benefit of the two non-parametric models used 
is the fact that they allow working with several classes immediately. 

A disadvantage of the algorithm based on the Support Vectors is that the results are 
sometimes more difficult to interpret, but in our case the evaluation metric is the 
accuracy, which helps us overcome this shortcoming. 
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In contrast, Decision Trees allow an interpretation of the results in a much simpler 
and faster way, and it is additionally a fact that it is one of the classification models that 
behave best when it comes to ordinal and/or mixed attributes, as in the dataset analyzed 
in our experiments. 

The Logistic Regression (multinomial logistic regression in our case) is another 
well-known and used classification model. In order to better outline the results, we also 
used it, but the accuracy obtained was not as good as the other classification models 
used: a value of 0.88. One of the reasons that led to such a result is that the attributes of 
the instances should be linearly separable, which was not true in our case. This 
shortcoming is, however, overcome in the case of the Support Vector based classifi- 
cation, which yielded an accuracy of 0.95. 

The different values obtained for each of the analyzed classification models fall 
within the standard ranges accepted in the research literature. They also come to 
confirm what determines our accuracy metric; namely that for the type of data analyzed 
by us (mixed) with such dimensions, the best classification models are Random Forest 
and Support Vector Machine. 

The value of the other metrics whose value was calculated is highlighted in 
Table 1: 


Table 1. Results achieved 


Metrics\classifier | Naive k-NN Logistic Decision SVM | Random 
models Bayes regression tree forest 
Sensitivity 0.75 0.73 0.77 0.78 0.74 0.77 
Specificity 0.68 0.72 0.62 0.63 0.62 0.63 
Precision 0.81 0.79 0.81 0.78 0.82 0.81 


7 Conclusions and Future Directions 


In this paper, we approached an interdisciplinary topic between psychology and 
computer science, more particular applied cognitive psychology and machine learning. 

We investigated the behavior of 6 learning classification models for a data set 
gathered based on QCS. We applied these models for a training data set and then we 
used the results achieved within testing data set in order to select the best of the 
learning models. Once we have established model with the best accuracy value, we can 
apply this model to new instances of data, to a new item unclassified yet. 

This work provides us a baseline to other future constructions related to this 
domain. We treated here the classification task, an approach useful to psychology area, 
and as far as we know, an innovative one in analysis of communication style from 
machine learning perspective. 

As new development directions, we propose to apply other classification methods 
that increase the value of accuracy metrics, to be more robust in relation to the data type 
(regardless of preprocessing methods applied) and obviously, increase the sample size 
of people questioned for a better generalization of results. 
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Abstract. In this work we address the problem of learning from images 
to perform grouping and classification of shapes. The key idea is to 
encode the instances available for learning in the form of directional 
data. In two dimensions, the figure to be categorized is characterized 
by the distribution of the directions of the normal unit vectors along the 
contour of the object. This directional characterization is used to extract 
characteristics based on metrics defined in the space of circular distribu- 
tions. These characteristics can then be used to categorize the encoded 
shapes. The usefulness of the representation proposed is illustrated in 
the problem of clustering and classification of otolith shapes. 


Keywords: Directional data - Shape representation 
Shape clustering - Shape classification 


1 Introduction 


Automatic induction from complex data that are characterized by functions, 
graphs, distributions or shapes is one of the important open problems in Machine 
Learning. In this work we address the task of grouping or classifying objects 
according to their shapes. A system that automatically discriminates among 
shapes is useful in numerous domains of application, such as archeology, pale- 
ontology, biology, geology, or medicine [10,12,15]. Shape can be defined as an 
equivalence class that is invariant under a family of transformations, such as 
translations, scaling, rotations, and small deformations [8]. 

One of the difficulties of this task is to provide an appropriate characteriza- 
tion of shape that is tractable yet preserves sufficient amounts of information to 
allow grouping and discrimination. Previous approaches to this problem are rep- 
resentations based on landmarks [11], medial representations [16,20], and others 
such as probability density functions [14]. In this work we will adopt a functional 
approach [17] and characterize the shape by the distribution of the normal vec- 
tors to the curve (in 2 dimensions) or surface (in 3 dimensions) that delimits 
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the figure [7]. Alternative functional encodings have been considered in recent 
investigations [1,13]. These representations are based on encoding shape the dis- 
tribution of distances between points within [1] or at the boundary of the object 
[13]. The focus of this work is on two-dimensional (planar) representations of 
objects. Nevertheless, the method proposed can be readily extended to higher 
dimensions using the tools of directional statistics [11]. To encode the distribu- 
tion of the directions of the normal vectors in two dimensions it is sufficient to 
store the angles of such vectors at each location on the boundary. These values 
can be viewed as samples from a random variable defined in [—7, 7]. This ran- 
dom variable is characterized by a probability distribution function defined on 
the circle. Examples of such probability density estimates are depicted in Fig. 1. 
To build an empirical estimates of these distributions we consider a set of N 
points sampled at regular intervals along the contour of the figure. At each of 
these points we compute the direction of the normal to the contour and store the 
corresponding angles {On}_,. An empirical estimate of the probability density 
is given by the histogram of the data using Npins equally spaced bins in [—7, 7]. 
The histogram is scaled so that the area under it is one. Alternatively, a kernel 
estimator is used to provide a smooth approximation of the density 


ficpn(6;v) ¡Ex (9 — On; v), (1) 


where K(0; v) is a periodic normalized kernel (i.e. its integral in 0 € [—7, 7] is 1), 
whose characteristic width is h = 1/v. In this work, the von Mises kernel is used 


1 v cos(0) 

K (6:0) O (2) 
where fo is the modified Bessel function of the first kind of order 0. For higher 
dimensions the von Mises-Fisher distribution can be used [4]. The quality of 
the kernel density estimate depends strongly on the value of this parameter 
[2,3,5,18]: On the one hand, if the kernel is too narrow, the density estimate 
will lack stability and exhibit large variance. If, on the other hand, the width is 
too large, relevant features of the probability density will be smoothed out. In the 
experimental evaluation performed, h = a provides a good compromise between 
stability and smoothness. Correspondingly, for the histogram density estimate, 


(a) Square (b) Circular sector 


Fig. 1. Empirical probability density for the directional variables in simple figures. 
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Fig. 2. Characterization of planar geometrical figures (top row) using a scaled his- 
togram (middle row) and kernel density (bottom row) empirical estimates of the prob- 
ability density of the direction of the normal vectors along the contour of the figure. 


Npins = 64 bins have been used. Figure 2 displays examples of these estimates 
for simple geometrical figures. Once the object representations have been char- 
acterized by the corresponding probability density estimates, their shapes can be 
compared using different metrics. To define these metrics, the discretized version 
of the probability densities f(0) and g(@) in the circle (0 € [—7,7]), at the sam- 
pling points 1a are considered; namely, f = { u and g = a 
with fan = F(0,) and gn = 9(0,), respectively. In terms of these discretized ver- 
sions of the densities, the metrics in Table 1 have been considered. Let F and G 
be two figures, characterized by f and g, respectively. The relative orientations 
of F and G could be different. Therefore, a distance between these figures can 
be defined as the minimum value of the metric between a rotation of the first 
density and the second density 


D(F,G)= _min D(f",g) (3) 
n=1,2,...,N 
where D is one of the metrics considered and fr! = fa, fntiy-+-) FN; 


fi,---,fn—1} is a rotation of the density f. The distance function given by 
Eq. (3) requires the evaluation of the specified metric for all N sampling points, 
which is a costly computation. A more effective way to account for rotations is 
to measure distances with respect to C, the uniform circular distribution in two 
dimensions (or the uniform distribution on a sphere in 3 dimensions): 


D(F,G) =|D(F,C) -D(G,C)!. (4) 
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As will be illustrated in the section on empirical evaluation this measure retains 
sufficient information to provide a reasonably good characterization of shape at 
a reduced computational cost. 


N 


Table 1. Distance functions between the discretized distributions f = {fn},_.,, and 


ES Im. 


Distance Expression 

Manhattan Lı(f,g) = > Ir — gnl 
Euclidean L2(f,g) = oS es 
Total variation | Doo (f, g) = maxn=1,2,....N |fn — gnl 
i La) = Dr tee 


Hellinger H(f,g) = Il (Vha — Gay 

Earth Mover’s | EM D(f,g) = inf E [|0 — 0'|] 

where the infimum is taken over all possible joint distributions 0 
and 6’, random variables whose marginals are f and g, respectively 


2 Clustering and Classification of Otoliths 


In this section we consider the categorization of otoliths for grouping and identi- 
fication of fish species. Otoliths are concretions of calcium carbonate and other 
inorganic salts that are formed by aggregation on a protein matrix in the inner 
ear of vertebrates [6]. The dataset studied consists of 240 high-contrast images 
of otoliths for three different families of fish: labridae (125 images), soleidae (70 
images), and scombridae (45 images). The images are centered and oriented so 
the to the frontal part of the otolith appears to the right of the image. This set 
has been retrieved from the AFORO database (http://www.icm.csic.es/aforo/), 
which is a an extensive open online repository of data for different fish species. 
Otoliths in labridae family are cuneiform, oval, bullet-shaped, or rectangular. 
They present a cleavage in the frontal zone, which, in general, is more promi- 
nent than in the other fish families. Soleidae otoliths are mainly discoidal and 
elliptic. Their shapes are in general more regular and smooth than in other two 
families. Finally, otoliths of the scombridae family have serrate contours and gen- 
erally are more elongated [19]. Examples of these otoliths are displayed in Fig. 3. 
Labridae and soleidae otoliths present higher shape variability than scombridae 
otoliths, which are typically more regular. In each of the images, the contour of 
the otolith is retrieved using the marching squares algorithm [9] and later rec- 
tified so that all figures have the same number of vertices. In the experiments, 
a number of 64 vertices is considered. This quantization of the contour reduces 
the variability in the representation and allows to preserve a sufficient amount 
of detail for an accurate characterization of the shape of the object. The figure 
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is then characterized by sampling a total of Nsample = 1000 points at regu- 
larly spaced intervals along the contour. For each of these sampling points the 
direction of the normal unitary vector is computed to obtain a sample of the 
directional variable. For planar figures, it can be represented as the angle that 
specifies the direction of this normal vector. The probability density of these 
direction values is then approximated using either a scaled histogram with 64 
bins or a KDE estimate that utilizes von Mises kernels of width h = en. In both 
cases, the probability density estimates are discretized at N = 64 points located 
at the center of the histogram bins. From the probability density estimates, two 
different characterizations will be used. In a first characterization, the figures 
are aligned at the maximum of the corresponding density estimates. The vector 
of attributes in this aligned representation consists in the N = 64 values of the 
corresponding histogram or KDE. A second characterization (labeled distances) 
is by a vector composed of the 6 distances between the corresponding density 
estimation and the uniform distribution, according to Eq. (4). Both unsuper- 
vised (clustering) and supervised (classification) learning tasks are considered. 
In a first set of experiments, the K-means algorithm is used to group the otoliths 
into K = 3 clusters. The results of this unsupervised learning task are displayed 
in Table 2. Several conclusions can be drawn from these results. The first one 
is that the clusters identified when the characterization based on distances to 
the uniform circular distribution are rather impure. The results are significantly 
better when the representation based on alignment is used, especially when the 
kernel density estimates (KDE) are employed. The reason why alignment is use- 
ful in these data is because otoliths typically have an oblong shape, with clearly 
defined axis. In a second set of experiments various k-nearest neighbors (k-NN) 
models are used to predict the shape of the figure based on different characteri- 
zations proposed. First rotationally invariant distance between shapes given by 
Eq. (3) is employed in the nearest-neighbors algorithm. The distances are com- 
puted using one of the six different metrics described in the previous section. 
The number of neighbors is determined using 3-fold cross-validation within the 
training data. The range of values explored is k = 3,5,...,13. In most cases 
the values selected are either k = 3 or k = 5. The results are not particularly 
sensitive to the choice of this parameter. The generalization error is estimated 
using 10-fold cross-validation over the complete data set. The results of these 
experiments are summarized in Table3. In all cases, the predictions are very 
accurate for all metrics, specially when the smoother kernel density estimation 
is used. However, because of the minimization step required in computing the 
distances Eq. (3), the computational cost of this algorithm is high. 

In a last set of experiments, as a solution with reduced computational cost, 
the aligned and distances representations are employed. For the distances rep- 
resentation, neighboring instances are identified using the Euclidean distance of 
the corresponding vector of attributes. Aligned version will be used to approxi- 
mate rotationally invariant metrics aligning discretizations according to its max- 
imum value. The generalization error and the optimal number of neighbors are 
selected by cross validation as in previous experiment. The confusion matri- 
ces and cross-validation error estimates for these experiments are displayed in 
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Labridae Soleidae Scombridae 


BD = 
E Y = 


Fig. 3. Examples of labridae, soleidae, and scombridae otoliths. The frontal part of the 
otolith, which corresponds to the head of the fish, appears on the right of the image. 


DI 


Fig. 4. In the top row of this figure the contours of otoliths of different fish families 
are shown: labridae (left), soleidae (middle), and scombridae (right). The histogram 
and kernel density estimates of the directional variables for these figures are displayed 
in the middle and bottom rows, respectively. 


Table4 for aligned representation and in Table5. As in the clustering experi- 
ment, the accuracy is better when aligned representations are used. The best 
overall accuracy is obtained when kernel density estimates are used, also when 
instances are characterized by distances representation. As in the previous exper- 
iments, the scombridae and soleidae otoliths are well separated. It is apparent 
from Fig.4, that the shapes of soleidae and scombridae otoliths are markedly 
different from each other. As a result, the discrimination between items from 
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Table 2. Number of otoliths from the labridae (lab), soleidae (sol), and scombridae 
(sco) assigned to each of the 3 clusters identified using K-means. The results on the 
left-hand side correspond to the 6-dimensional distances representations. The results 
obtained when the 64-dimensional aligned representations are used are displayed on the 
right-hand side. The error indicates the proportion of incorrectly grouped instances. 


Distances Aligned 

Histogram | KDE Histogram KDE 

C1} C2} C3} C1) C2}C3 C1 | C2) C3} Cl | C2} C3 
lab 86 [20 19 |93 11 21 113 7 |5 1136 
sol 7 [63 0 |4 (66/0 |3 67/0 0 [70 0 
sco 6 |O /39,4 O [41 4 0 [413 |O |42 
Err | 0.22 0.17 0.08 0.06 


Table 3. 10-fold cross-validation estimates of the confusion matrices for otolith clas- 
sification. The instances, which are characterized by either the histogram or a kernel 
estimate (KDE) of the probability density function of the normal vectors along the 
contour of the figure, are categorized as labridae (lab), soleidae (sol), and scombridae 
(sco) using k-NN with rotationally invariant metrics. The error indicates the propor- 
tion of incorrectly grouped instances. 


Histogram | KDE Histogram | KDE 

lab | sol} sco | lab | sol | sco lab | sol | sco | lab | sol | sco 
lab Li 119/0 |6 1123/0 [2 |EMD}120)1 [4 |124/0 1 
sol 2 |68/0 JO ,7010 3 1670 JO |70 0 
sco 8 JO 137 |5 (0 |40 7 |O |38 |4 |O 41 
Err 0.05 + 0.04 | 0.05 + 0.04 0.11 + 0.06 | 0.10 + 0.06 
lab | Lə | 121 4 |124 1 |H 119 6 | 123 
sol 3.1670 JO 70.0 2 168 0 |0 [70 0 
sco 9 JO 136 |5 0 |40 7 JO 138 |5 JO | 40 
Err 0.05 + 0.04 | 0.04 + 0.04 0.06 + 0.05 | 0.06 + 0.05 
lab Lo 1190 |6 [130 |2 |x? 118/1 |6 [121/2 2 
sol 2 |68/0 ¡0 70 2 1680 JO |70 
sco 6 |0 139 |5 0 |40 7 JO 138 |6 JO 39 
Err 0.05 + 0.04 | 0.05 + 0.04 0.06 + 0.04 | 0.06 + 0.04 


these classes is fairly easy. If fact, they are well separated in all the represen- 
tations considered. This is not the case for the labridae otoliths. Some otoliths 
from this class present elongated shapes, which are more characteristic of scom- 
bridae. Others present a circular, more regular shapes, and can be mistaken for 
instances from the soleidae class. 
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Table 4. 10-fold cross-validation estimates of the confusion matrices for otolith classi- 
fication. The instances, which are characterized by aligned representations of either the 
histogram or a kernel estimate (KDE) of the probability density function of the normal 
vectors along the contour of the figure, are categorized as labridae (lab), soleidae (sol), 
and scombridae (sco) using K-NN. The error indicates the proportion of incorrectly 
grouped instances. 


Histogram |KDE Histogram |KDE 

lab | sol | sco | lab | sol | sco lab | sol | sco | lab | sol | sco 
lab Li 116/7 ‚2 1121/4 [0 |EMD/113 10/2 |122/3 |0 
sol 7 |62/1 |2 6810 9 61.0 |3 [67 0 
sco 7 JO |38 |8 0 |37 6 0 139 8 JO [37 
Err 0.10 + 0.04 |0.06 + 0.05 0.11 + 0.06 |0.06 + 0.03 
lab Lə 117 2 |122 0 |H 116 3 |120 0 
sol 10 |59 4 66/10 10 60/0 |2 /|68/0 
sco 10 [0 |35 |7 (0 |38 9 O [36 |7 JO |38 
Err 0.12 + 0.06 | 0.06 + 0.04 0.12 + 0.06 | 0.06 + 0.07 
lab Lo 116 121 1 |x? 114 3 | 120 0 
sol 8,620 |3 67 0 10 59/11 |3 |67/0 
sco 11 [0 |34 |7 (0 |38 10 0 135 |8 |0 [37 
Err 0.12 + 0.05 | 0.06 + 0.05 0.13 + 0.06 | 0.07 + 0.04 


Table 5. Confusion matrices and 10-fold cross-validation errors for the classification 
of labridae (lab), soleidae (sol), and scombridae (sco) otoliths using a characterization 
based on distances to the uniform circular distribution (distances). The error indicates 
the proportion of incorrectly grouped instances. 


Histogram | KDE 

lab | sol | sco | lab | sol | sco 
lab |98 |5 122 110/17 8 
sol 118 |52 10 |10 [60 0 
sco 11 |0 |34 [15 |O | 30 


3 Conclusions 


In this work we propose a characterization of shapes of objects based on direc- 
tional data. The ultimate goal is to categorize these objects according to their 
shapes. Intuitively, shape can be defined as the geometrical property shared 
by different objects that is invariant to a loosely-defined family of transforma- 
tions, which includes translations, scaling, rotations, and small deformations. 
In our method, shape is encoded by the distribution of unit vectors along the 
normal direction at the boundaries of the object. In this manner, information 
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on scale and absolute distances is eliminated while preserving directional infor- 
mation, which is expected to encode shape. This representation is obtained by 
first locating these boundaries with standard image processing algorithms. Then, 
normal unit vectors are computed at a set of points located on this boundary. 
These vectors can be seen as realizations of a directional random variable. This 
random variable can be characterized by its distribution. In two dimensions, the 
boundary is a curve, so the distribution of normal vectors is defined on the circle. 
Empirical estimates of the probability density function can be computed using 
a scaled histogram, or a smoother kernel density estimation using, for instance, 
von Mises kernels. In this work both options are explored. For three-dimensional 
representations, the boundary of the object is a surface. The distribution of 
three-dimensional unit normal vectors are defined on a sphere. In this case von 
Mises-Fisher kernels can be used [4]. Since the representation is functional (and 
therefore, infinite dimensional), further reductions of information are needed so 
that it can be used in practice. In particular, for two-dimensional data, the 
probability densities are discretized at regularly N = 64 spaced points along the 
circle. Since shape should be invariant with respect to rotations, computationally 
expensive shape-alignment operations are needed when this representations are 
used for clustering or classification. A further dimensionality reduction, which 
is rotationally invariant, can be made using as features the distances of the dis- 
cretized densities and the uniform circular distribution. The usefulness of these 
representations for clustering and classification of images of objects according 
to their shapes is illustrated with otolith data. Techniques based on exhaustive 
minimization, while being computationally costly, are very accurate and provide 
the best overall results, especially when kernel estimates of the density of direc- 
tional variables are used. From these experiments we conclude that distances 
representation provides a significant reduction on the computational cost while 
reducing the quality of the results. The aligned representation provides a good 
balance between performance and computational cost. This empirical investiga- 
tion illustrates the effectiveness of the directional data representation proposed 
to encode shapes. 


References 


1. Berrendero, J.R., Cuevas, A., Pateiro-Lpez, B.: Shape classification based on inter- 
point distance distributions. J. Multivariate Anal. 146, 237-247 (2016). Special 
Issue on Statistical Models and Methods for High or Infinite Dimensional Spaces 

2. Chaubey, Y.P.: Smooth kernel estimation of a circular density function: a con- 
nection to orthogonal polynomials on the unit circle. J. Probab. Stat. 2018, 4 p. 
(2018). https: //doi.org/10.1155/2018/5372803. Article ID 5372803 

3. Di Marzio, M., Panzera, A., Taylor, C.: A note on density estimation for circular 
data (2012) 

4. Fisher, R.: Dispersion on a sphere. Proc. Roy. Soc. Lond. Ser. A: Math. Phys. Sci. 
217(1130), 295-305 (1953) 

5. Garcia-Portugués, E.: Exact risk improvement of bandwidth selectors for kernel 
density estimation with directional data. Electron. J. Statist. 7, 1655-1685 (2013). 
https://doi.org/10.1214/13-EJS821 


15. 


16. 


17. 
18. 


19. 


20. 


Directional Data Analysis for Shape Classification 607 


Giménez, J., Manjabacas, A., Tuset, V.M., Lombarte, A.: Relationships between 
otolith and fish size from Mediterranean and Northeastern Atlantic species to be 
used in predator-prey studies. J. Fish Biol. 89(4), 2195-2202 (2016) 

Grogan, M., Dahyot, R.: Shape registration with directional data. Pattern Recogn. 
79, 452-466 (2018) 

Kendall, D.G.: A survey of the statistical theory of shape. Statist. Sci. 4(2), 87-99 
(1989). https: //doi.org/10.1214/ss/1177012582 

Lorensen, W.E., Cline, H.E.: Marching cubes: a high resolution 3D surface con- 
struction algorithm. Comput. Graph. 21(4), 163-169 (1987) 


. MacLeod, N.: Geometric morphometrics and geological shape-classification sys- 


tems. Earth-Sci. Rev. 59(1), 27-47 (2002) 


. Mardia, K.V., Jupp, P.: Directional Statistics. Wiley Series in Probability and 


Statistics. Wiley, New York (2009) 


. Gavrielides, M.A., Kallergi, M., Clarke, L.P.: Automatic shape analysis and classi- 


fication of mammographic calcifications (1997). https: //doi.org/10.1117/12.274175 


. Montero-Manso, P., Vilar, J.: Shape classification through functional data 


reparametrization and distribution-based comparison (2017) 


. Moyou, M., Ihou, K.E., Peter, A.M.: LBO-Shape densities: a unified framework for 


2D and 3D shape classification on the hypersphere of wavelet densities. Comput. 
Vis. Image Underst. 152, 142-154 (2016) 

Mu, T., Nandi, A.K., Rangayyan, R.M.: Classification of breast masses using 
selected shape, edge-sharpness, and texture features with linear and kernel-based 
classifiers. J. Digit. Imaging 21(2), 153-169 (2008) 

Pizer, S.M., Thall, A.L., Chen, D.T.: M-Reps: a new object representation for 
graphics. Technical report, ACM Transactions on Graphics (2000) 

Ramsay, J., Silverman, B.: Functional data analysis (1997) 

Taylor, C.C.: Automatic bandwidth selection for circular density estimation. Com- 
put. Stat. Data Anal. 52(7), 3493-3500 (2008) 

Tuset, V., Lombarte, A., Assis, C.: Otolith atlas for the Western Mediterranean, 
North and Central Eastern Atlantic. Scientia Marina 72(Suppl. 1), 7-198 (2008) 
Yushkevich, P., Pizer, S.M., Joshi, S., Marron, J.S.: Intuitive, localized analysis of 
shape variability. In: Insana, M.F., Leahy, R.M. (eds.) IPMI 2001. LNCS, vol. 2082, 
pp. 402-408. Springer, Heidelberg (2001). https: //doi.org/10.1007/3-540-45729- 
LAL 


Check for 
updates 


Semantic Space Transformations for 
Cross-Lingual Document Classification 


Jiří Martinek!, Ladislav Lenc?, and Pavel Kral?) 


! Department of Computer Science and Engineering, Faculty of Applied Sciences, 
University of West Bohemia, Plzen, Czech Republic 
{jimar,pkral}@kiv.zcu.cz 
2 NTIS - New Technologies for the Information Society, Faculty of Applied Sciences, 
University of West Bohemia, Plzeň, Czech Republic 
llenc@kiv.zcu.cz 


Abstract. Cross-lingual document representation can be done by train- 
ing monolingual semantic spaces and then to use bilingual dictionaries 
with some transform method to project word vectors into a unified space. 
The main goal of this paper consists in evaluation of three promising 
transform methods on cross-lingual document classification task. We also 
propose, evaluate and compare two cross-lingual document classification 
approaches. We use popular convolutional neural network (CNN) and 
compare its performance with a standard maximum entropy classifier. 
The proposed methods are evaluated on four languages, namely English, 
German, Spanish and Italian from the Reuters corpus. We demonstrate 
that the results of all transformation methods are close to each other, 
however the orthogonal transformation gives generally slightly better 
results when CNN with trained embeddings is used. The experimental 
results also show that convolutional network achieves better results than 
maximum entropy classifier. We further show that the proposed methods 
are competitive with the state of the art. 


1 Introduction 


The performance of many Natural Language Processing (NLP) systems is 
strongly dependent on the size and quality of annotated resources. Unfortunately, 
there is a lack of annotated data for particular languages/tasks and manual anno- 
tation of new corpora is a very expensive and time consuming task. Moreover, 
the linguistic experts from the target domain are often required. These issues 
can be solved by the usage of cross-lingual text representation methods. The 
classifiers are trained on resource-rich languages and the cross-linguality allows 
using the models with data in other languages with no available training data. 
The text document representations are often created using multi-dimensional 
word vectors, often so called word embeddings (Levy and Goldberg [10]). One 
way of creating cross-lingual representations is to use transformed semantic 
spaces. Such approaches take a monolingual, independently trained, semantic 
space and project it into a unified space using some transformation method. 


© Springer Nature Switzerland AG 2018 
V. Kürkovä et al. (Eds.): ICANN 2018, LNCS 11139, pp. 608-616, 2018. 
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Several such transformation methods have been proposed. However, to the 
best of our knowledge, a comparative study of the role of different transforma- 
tion methods/classifiers for the document classification across several languages 
is missing. Therefore, the main contribution of this paper consists in the thor- 
ough study of the impact of three promising transform methods, namely Least 
Squares Transformation (LST), Orthogonal Transformation (OT) and Canoni- 
cal Correlation Analysis (CCA), for cross-lingual document classification. More 
information about linear transformations to build cross-lingual semantic spaces 
can be found in [2,3]. In this context, we propose, evaluate and compare two 
cross-lingual document classification approaches. The first one uses directly the 
transformed embeddings in different languages while the second one realizes a 
simple word translation by choosing the closest word using cosine similarity of 
the embedding vectors. 

For classification, we use popular convolutional neural network (CNN) and 
compare its performance with a standard maximum entropy classifier. The pro- 
posed methods are evaluated on four languages, namely English, German, Span- 
ish and Italian from the Reuters corpus. 


2 Literature Review 


Recent work in cross-lingual text representation field is usually based on word- 
level alignments. Klementiev et al. [7] train simultaneously two language models 
based on neural networks. The proposed method uses a regularization which 
ensures that pairs of frequently aligned words have similar word embeddings. 
Therefore, this approach needs parallel corpora to obtain the word-level align- 
ment. Zou et al. [13] propose an alternative approach based on another neural 
network language models using different regularization. 

Kotisky et al. [8] propose a bilingual word representation approach based 
on a probabilistic model. This method simultaneously learns alignments and 
distributed representations for bilingual data. Contrary to the prior work, which 
is based on parallel corpora or hard alignment, this method marginalizes out the 
alignments, thus captures a larger bilingual semantic context. 

Chandar et al. [4] investigate an efficient approach based on autoencoders 
that uses word representations coherent between two languages. This method is 
able to obtain high-quality text representations by learning to reconstruct the 
bag-of-words of aligned sentences without any word alignments. 

Coulmance et al. [5] introduce an efficient method for bilingual word repre- 
sentations called Trans-gram. This approach extends popular skip-gram model 
to multi-lingual scenario. This model jointly learns and aligns word embeddings 
for several languages, using only monolingual data and a small set of sentence- 
aligned documents. 
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3 Cross-Lingual Document Classification 


3.1 Document Representation 


We use three document representations in our experiments. The first one is the 
Bag-of-Words (BoW). The second approach called averaged embeddings utilizes 
word embeddings. It averages the word vectors for all words occurring in the 
document. Its length corresponds to the embeddings dimensionality. The last 
method uses the sequence of words in the document and transforms it to the 
2D representation suitable for the CNN. The words are one-hot encoded and are 
translated using a look-up table by the corresponding embeddings. Further we 
describe the three ways how we achieve the cross-linguality in our classification 
methods. 


Machine Translation. Machine translation (MT) is used as a strong base- 
line for comparison with the two other methods. The documents are translated 
using Google API. The translation is then used in the same way as if classifying 
documents in one language. 


Transformed Embeddings. This approach relies on the transformed word 
embeddings. The representations of the training documents are created from 
the original word embeddings in the language, which was used for the training 
of the model. The documents in the testing dataset are then represented by 
the embeddings transformed to the language of the model. This method will be 
hereafter called transformed (emb)eddings. 


Embedding Translation. This method is also based on the transformed 
embeddings. However, the embeddings are used for per-word translation of the 
documents instead of using it directly. It utilizes the non-transformed embedding 
in the target language and the transformed one from source to target language 
for similarity search. The most similar word in the target language is found for 
each word in the source language by cosine similarity. This method is in the 
following text referred as (emb)edding translation approach. 


3.2 Classification Models 


Maximum Entropy. The first classifier is the Maximum Entropy (ME) model 
Berger et al. [1]. It takes for each document an input with a fixed number of 
features, represented as a feature vector F, and outputs the probability distribu- 
tion P(Y = y|F) where y € C (set of all possible document classes). This model 
is popular in the natural language processing field, because it usually gives good 
classification scores. 
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Convolutional Neural Network. The second classifier is a popular Convolu- 
tional Neural Network (CNN). It also outputs normalized scores interpreted as a 
probability distribution P(Y = y|F) over all possible labels. The network we use 
was proposed by Lenc and Kräl [9] and it was successfully used for multi-label 
classification of Czech documents. The architecture of the network is inspired by 
Kim [6]. The main difference from Kim’s network is that this net uses different 
number and size of convolutional kernels. 

We perform a basic preprocessing which detects all numbers and replaces 
them by one “NUMERIC” token. Then the document length is adjusted to 
a fixed value. Longer documents are shortened while shorter ones are padded 
so that they have fixed length L. A vocabulary of the most frequent words 
is prepared from the training data. The words are then represented by their 
indexes in the vocabulary. The words that are not in the vocabulary are assigned 
to a reserved index (“OOV”) and the “PADDING” token has also a reserved 
index. 

The input of the network is a vector of word indexes of the length L where 
L is the number of words used for document representation. The second layer 
is an embedding layer which represents each input word as a vector of a given 
length. The document is thus represented as a matrix with L rows and E columns 
where E is the embeddings dimensionality. The embedding layer can be initial- 
ized either randomly and trained during the network training process or use the 
pre-trained word embeddings as its weights. The third layer is the convolutional 
one. N convolutional kernels of the size K x 1 are used which means that a 1D 
convolution over one position in the embedding vector over K input words is 
performed. The following layer performs max pooling over the length L- K +1 
resulting in N 1x E vectors. This layer is followed by a dropout layer Srivas- 
tava et al. [12] for regularization. The output of this layer is then flattened and 
connected with a fully connected layer with D neurons. After another dropout 
layer follows the output layer with C neurons which corresponds to the number 
of the document categories. The architecture of the network is depicted in Fig. 1. 


4 Experiments 


4.1 Reuters Corpus Volume I 


We use four languages, namely English (en), German (de), Spanish (es) and 
Italian (it) from Reuters Corpus Volume I (RCV1-v2) Lewis et al. [11] with 
similar setup as used by Klementiev et al. [7]. The documents are classified into 
four following categories: Corporate/industrial - CCAT, Economics - ECAT, 
Government/social - GCAT and Markets - MCAT. 

As the other studies we use the standard accuracy metric in our experiments. 
The confidence interval is 40.3% at the confidence level of 0.95. 


4.2 Baseline Approaches Results 


Our first baseline method is a majority class (MC) classifier which determines 
the distribution of categories in the training dataset and chooses the most fre- 
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Fig. 1. Convolutional neural network architecture. 


quent class. In testing phase, all test documents are classified into this most 
frequent class. The accuracy of this classifier is depicted in third column of 
Table 1. These results show that the corpus is unbalanced and that there are 
significant differences among different languages. 

The second baseline is the machine translation (MT) approach. The results 
with the ME classifier are reported in Table 1, while the accuracy of the CNN is 
shown in Table 2 (column MT). Classification accuracies of this approach are very 
high and show that the translation results have a strong impact on document 
classification. 


4.3 Proposed Approaches Results 


The embedding translation approach needs repeatedly searching the target 
semantic space which is computationally demanding. In order to reduce the 
computational burden we set the vocabulary size |V| = 20,000. The vocabulary 
is constructed from the most frequent words in the training set. To increase 
efficiency of searching, we created vocabulary mapping dictionary between each 
pair of languages. There is a mapping onto target language vocabulary for each 
word in the source language. This dictionary is the centerpiece of the embed- 
ding translation. If the source word is not present in the vocabulary, the out 
of vocabulary token (“OOV”) is used. Each proposed method is experimentally 
validated on two classification models. 


Maximum Entropy Results. The last six columns in Tablel show the 
results of the maximum entropy classifier with transformed (emb)eddings and 
(emb)edding translation methods. Three linear transformations are used. 
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Table 1. ME classifier results. Columns 3 and 4 represent the majority class (MC) 
and machine translation (MT) baselines. The rest of the table shows results for the 
proposed methods with different embedding transformations. 


Languages | Baselines [%] | Proposed approaches [%] 

MC MT Transformed emb | Emb translation 
Train | Test LST CCA OT |LST|CCA) OT 
en de |30.4 91.9 54.4 62.0 |52.2 | 75.6 |75.8 | 76.3 
en es (14.7 81.5 52.9 39.0 [49.9 |57.2 148.8 | 49.8 
en it (36.0 71.2 48.4 | 42.2 |56.0 | 58.1 | 51.0 | 51.5 
de en | 23.9 | 76.7 59.9 60.4 | 57.4 | 66.6 | 69.0 | 70.2 
de es |8.76 81.1 35.9 32.5 |29.4 |73.2 |58.5 | 63.8 
de it | 9.50 | 67.0 58.7 57.5 | 57.0 | 47.2 | 47.7 | 46.4 
es en | 23.3 74.3 47.8 53.2 |45.4 | 67.3 | 70.5 | 69.6 
es de | 22.6 85.7 51.7 43.8 | 47.6 | 74.5 | 70.1 | 70.3 
es it (36.4 67.9 22.8 19.8 |29.7 171.2 |72.0 | 72.1 


it en |23.3 69.7 68.5 69.8 163.8 | 56.8 | 55.6 | 50.2 
it de 22.6 86.9 48.7 45.1 152.6 | 76.6 77.4 76.8 
it es |67.7 80.8 61.4 49.4 (59.5 | 75.3 |75.2 | 70.5 


This table shows that the grammatically close languages (same family) give 
usually better results than the other ones. More concrete, en > de and es + it 
have generally better results than for instance en + es (it) or de — es (it). 

The results further show, that the three transformation are comparable in 
many cases. However, LST gives the best results in several other cases (e.g. 
en — es (it) or de > es (it)) and OT gives the significantly worst results for 
some cases (e.g. it — en). Based on this experiment we can propose generally 
to use LST with ME classifier and static word embeddings. 


CNN Results. In all our experiments we use the vocabulary size |V| = 20, 000. 
The document length L is set to 100 tokens and the embedding length E is 300 
in all cases. We use N = 40 convolutional kernels of size 16 x 1 (K = 16). The 
dropout probability is set to 0.2. The size of the first fully connected layer is 256. 
The output layer has 4 neurons (C = 4) while we are classifying into 4 classes. 
All layers except the output one use relu activation function. The output layer 
uses the softmaz activation function. 

The direct usage of embedding vector is depicted in the leftmost columns 
of the Proposed Approaches part of Table 2. The results of this method are the 
worsts one among the other proposed approaches, however it is the simplest one. 

The last six columns emb translation in Table 2 show the results of CNN on 
the (emb)edding translation method. In Table 2 there are two sets of results. The 
first one is the set of results, when embedding layer was excluded from learning 
(stat), while in the second case the embeddings layer are further fine-tuned by 
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Table 2. CNN results. Columns 3 and 4 represent the MT baseline. The rest ofthe table 
presents result of the proposed methods with different embedding transformations. 
Term stat means the static word embeddings while the term rnd means the using of 
randomly initialized embeddings with a subsequent training. 


Languages | Baselines [%] | Proposed approaches [% 
MT Transformed emb | Emb translation (stat) | Emb translation (rnd) 

Train | Test | rnd | stat LST | CCA | OT | LST | CCA | OT LST | CCA | OT 
en de | 89.7 | 86.5 62.2 | 64.4 | 56.0 | 78.9 | 79.1 | 81.3 80.4 | 80.4 | 82.7 
en es | 85.7 | 69.8 24.4 | 26.7 | 23.6 | 82.0 | 77.0 | 76.9 81.9 | 75.7 | 72.9 
en it 74.7 | 65.8 27.7 | 26.9 |18.0 |68.1 | 70.2 | 68.6 71.1 | 70.5 | 68.3 
de en | 61.3 | 59.4 66.4 | 64.8 | 58.4 | 69.3 | 70.0 | 70.2 72.3 | 75.6 | 75.6 
de es | 64.7 | 55.7 65.0 | 57.6 | 55.2 | 51.0 | 55.3 | 54.5 81.4 79.3 | 80.7 
de it 47.8 | 48.8 39.1 | 50.9 | 58.4 | 44.8 | 48.6 | 49.2 68.7 71.1 |71.2 
es en |60.7 | 67.6 49.2 | 41.1 |41.5 | 51.9 | 54.9 | 55.0 59.0 | 63.1 | 63.0 
es de | 76.8 | 81.8 42.9 | 54.1 | 58.0 | 58.5 | 72.2 | 81.5 54.7 | 69.2 | 82.0 
es it 62.4 | 61.9 20.2 | 35.5 | 37.5 | 68.0 | 70.9 | 71.5 73.0 | 76.2 | 76.7 
it en | 69.1 | 65.7 42.5 | 44.8 | 46.4 | 41.4 | 41.8 | 41.1 54.8 54.7 | 51.3 
it de | 85.4) 81.5 37.0 | 38.3 | 44.5 | 59.6 | 72.7 | 74.1 63.0 | 76.0 | 60.8 
it es | 80.1 | 73.8 68.5 | 68.8 | 68.1 | 61.3 | 61.3 | 62.1 78.6 | 78.6 | 78.7 


a training (rnd). In the table we can observe, that the embedding training has 
a positive impact for classification. Moreover, the impact of the transformation 
differ from the previous case (see Table 1). We can suggest to use OT as the best 
transformation method when CNN with trained embeddings are used. 


4.4 Comparison with the State of the Art 


In this experiment, we compare the results of our best approach with the state 
of the art (see Table3). These results show that the state-of-the-art methods 
slightly outperform the proposed approaches, however we must emphasize that 
our main goal consists in the comparison of several different methods. Moreover, 
the proposed approaches are very simple. 


Table 3. Comparison with the state of the art. 


Method en — de [%] | de — en [%] 
Klementiev et al. [7] 77.6 71.1 
Koéisky et al. [8] 83.1 76.0 
Chandar et al. [4] 91.8 74.2 
Coulmance et al. [5] 91.1 78.7 
Best proposed configuration | 82.7 75.6 
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5 Conclusions 


This paper presented a thorough study of the impact of three promising trans- 
form methods, namely least squares transformation, orthogonal transformation 
and canonical correlation analysis, for cross-lingual document classification. In 
this context, we proposed and evaluated two cross-lingual document classification 
approaches. The first one uses directly the transformed embeddings in different 
languages without any modification while the second one realizes the simple word 
translation choosing the closest word using cosine similarity of the embeddings. 
We compared the performance of standard maximum entropy classifier with our 
architecture of convolutional neural network for this task. 

We evaluated the proposed approaches on four languages including English, 
German, Spanish and Italian from Reuters corpus. We have shown that the 
results of all transformation methods are close to each other, however the 
orthogonal transformation gives generally slightly better results when CNN with 
trained embeddings is used. We have also demonstrated that convolutional neu- 
ral network achieves significantly better results than maximum entropy classifier. 
We have further presented that the proposed methods are competitive with the 
state of the art. 
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Abstract. Compression distances can be a very useful tool in automatic 
object clustering because of their parameter-free nature. However, when 
they are used to compare very different-sized objects with a high per- 
centage of noise, their behaviour might be unpredictable. In order to 
address this drawback, we have develop an automatic object segmenta- 
tion methodology prior to the string-compression-based object cluster- 
ing. Our experimental results using the xeno-canto database show that 
this methodology can be successfully applied to automatic bird species 
identification from their sounds. These results show that applying our 
methodology significantly improves the clustering performance of bird 
sounds compared to the performance obtained without applying our 
automatic object segmentation methodology. 


Keywords: Data mining - Normalized compression distance 
Clustering - Dendrogram - Bird sound classification 
Silhouette coefficient - Similarity - Object segmentation 


1 Introduction 


Automatic bird species classification from their sounds may be extremely use- 
ful in fields such as ecology, behavioral biology or conservation biology, among 
others. However, from a technical point of view, it can be a challenging task to 
perform due to aspects such as the high variety of existing species, song simi- 
larities between distinct species or background noises. The xeno-canto dataset 
[13] is a collection of bird sound recordings from birds all over the word that 
is available online. This data set has been used in several studies on bird songs 
classification in the literature [1,11, 12]. In order to improve the classification per- 
formance, these works usually carry out an extensive previous analysis, which in 
© Springer Nature Switzerland AG 2018 
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most cases comprises, at least, previous treatment of the samples (background 
noise), feature extraction (analysis and identification of the song components) 
and parameter selection (discrimination process). This strong dependence on 
parameter selection may not be the most appropriate for such a heterogeneous 
data set. Some examples of this heterogeneity are the variability in the dura- 
tion of the data samples, the background noise, or even other bird songs that 
occasionally appear in audio recordings. 

Compression distances can be a very convenient tool in the automatic 
bird song species identification because of their parameter-free nature, broad 
applicability and leading efficacy in many domains. Among others, audio [15], 
images [5,6], documents [7,8] or computer security [2] represent examples of the 
transversal application of this methodology. In a preliminary study, we utilized 
one of the most successfully applied compression distances, the Normalized Com- 
pression Distance (NCD) to bird song classification [15]. We showed that NCD 
can be applied as an alternative approach to identify bird species from audio 
samples. This work, however, does not address an issue that compression dis- 
tances have when they are applied to compare two very different-sized objects. 
This problem resides in the fact that two objects can be considered to be signif- 
icantly different by the NCD even though they are similar [3,9]. Although this 
might not be a problem in scenarios where there are significant differences in 
the size of the objects to be compared, it does not apply to this database. In 
this case, the high heterogeneity in the duration and noise of the data samples 
demands additional considerations. 

Another issue that was not addressed either in [15] and typically appears 
in the xeno-canto data base, occurs in audios where the relevant information is 
surrounded by big amounts of noise. This is the case of some audios where the 
majority of the sample is composed of background noise such as microphone arti- 
facts, human voices or even other bird species. Although compression distances 
have a high noise tolerance, they are based on size reduction relations between 
pairs of objects. This makes them robust against objects where the information 
is mixed with the noise, but very weak against data samples where the noise size 
overcomes the relevant-information size. 

In this work, we address these issues reassembling each audio file from a selec- 
tion of their fragments. Our aim was to segment the bird song recordings, ana- 
lyze their NCD matrix and select only the fragments with relevant-information. 
Next, we use the “relevant” fragments to assemble new audios without noise 
and, thereby, to achieve a more accurate representation of their distances. The 
results presented in this paper show that applying our methodology significantly 
improves the clustering performance of bird sounds compared to the performance 
obtained without applying our automatic object segmentation methodology. 


2 Normalized Compression Distance 


The Normalized Compression Distance (NCD) is a metric that provides a mea- 
sure of similarity between two objects based on the use of compression algo- 
rithms. The fundamental idea behind the NCD is that given two objects x and 
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y, when a compression algorithm encodes the concatenated xy, it searches for the 
information shared by x and y to reduces the redundancy of the whole sequence. 
This concept was studied by [3,10]. The NCD between two objects x and y is 
defined as: 


max{C (zy) — C(x), C(yx) — C(y)} 


NOP) = mat dl 


where C is a compression algorithm and C(x) and C(ay) are the size of the 
C-compressed versions of x and the concatenation of x and y, respectively. Since 
compression algorithms can be applied to all types of digital data, the NCD 
has been applied in many areas (see Sect. 1). However, the application of the 
NCD involves some difficulties that depend on the context. The size differences 
between objects, the symbolization of the information or a high percentage of 
noise, are some examples of these difficulties. The first problem could be under- 
stand simply by looking at the Eq. 1. Initially, the NCD is based on the assump- 
tion that every object x, is reducible by a compressor. A typical case where this 
does not occur is when one or both objects have already been compressed. In 
the same fashion, if two similar and reducible objects x and y are not reduced 
proportionally by a compressor the NCD will be near 1. 


2.1 Object Size Problem 


If we consider the case where one of the objects is extremely big, and the other 
one is significantly small, the NCD between these objects will be closer to 1, 
regardless of the information contained in them. In Fig. 1, one can see that the 
resolution of the NCD domain decreasses as the size difference increases. For 
instance, around 20KB the average NCD is 0.998 while for 1 MB it increases 
to 0.9995. This means that the NCD metric loses resolution as the object size 
heterogeneity increases. 


9.995 


9.990 


Average 


9.985 51.0 


9.980 


10? 103 10? 10? 
Maximum size of the interval 


Fig. 1. Average NCD and variance, for different subsets of samples of the same bird 
species. Each subset is taken from a unique set limiting the size of the samples by an 
upper bound. In other words, for a Maximum size of 50 KB the objects’ size used to 
measure the average NCD will be between 0 and 50KB. Hence, while the maximum 
size increases, the NCD gets closer to 1. This is a problem because an NCD near to 1 
implies dissimilarity. Simultaneously the variance falls near to 0. This, together with 
the average, reduces the resolution of the NCD domain, and thereby, the identification 
capabilities of this method. 
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As we mentioned in the introduction, another issue related with this database 
appears when the bird song appears in a small percentage of the audio sample. 
Following the Eq.1, one can observe that as long as the compression algorithm 
correctly identifies the information between the objects x and y, the size of 
C(ay) should be smaller than the size of max{C(x),C(y)}. For our case, one 
will assume that only when x and y belong to the same bird species, the NCD 
will be low. However, the noise can also be used to reduce the size of C(xy). As 
an example, two audios can have the same background noise (other bird songs, 
microphone artifacts, crickets, etc.) and belong to different species. 


(poygaur mo 19378) 
suos pırq porqurosseoyy 


Original bird songs 
(before our method) 


Fig. 2. Hierarchical clustering of different NCD matrix of the Great Horned Owl and 
the Common Redshank (the blue nodes correspond to the Common Redshank). The 
left panel shows the dendrogram obtained from the original samples (pre-processed to 
the same format), while the right panel shows the dendrogram obtained when applying 
our segmentation methodology, described in Sect.3. The sizes of the objects of the 
left dendrogram are different, while in the right one each object has the same size. 
The Silhouette Coefficients for both panels are 0.081 and 0.291, respectively. It has to 
be pointed out that, the right dendrogram corresponds to the highest point of Fig. 3. 
(Color figure online) 


Hence, for these cases the NCD will not be enough to identify each species due 
to the loss of resolution. In Fig. 2, we can observe how the great heterogeneity in 
the size of the objects affects the final clustering with an example of a subset of 
audio samples. The left panel shows the dendrogram obtained from the original 
objects, while the right panel shows the dendrogram obtained when applying 
our segmentation methodology approach. 


3 Materials and Methods 


We have used the audio data from the online database xeno-canto [13], which 
has a great heterogeneity among its audio files. This increases the difficulty of 
the problem but also makes it very interesting from the point of view of a free 
parameter identification technique. The clustering process has been performed 
by a MQTC based hierarchical clustering algorithm from Complearn software [4]. 
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As described in Sect.2, the size variety and significant percentage of noise 
(compared to the relevant information) in objects could harm the identification 
process considerably. For these reasons, we propose a simple methodology to use 
the NCD, as a parameter free technique, among sets of audio samples with big 
heterogeneity without further considerations. 

In our previous work [15], we successfully identified different bird song sam- 
ples between pairs of species, in similar-sized objects. In this case, we address 
the problem of size differences between objects and their noise, as an amplified 
version of the problem presented in [15]. In this fashion, our hypothesis is that 
finding the relevant segments among the audio samples is equivalent to solve the 
loss of resolution, metioned in Sect. 2.1. 

In order to determine the relevant information of the audio samples, we have 
performed a segmentation process. Initially, we have parsed each sample to a 
standard format (for more details see Sect.4) removing any metadata in the 
process. Next, we have split each audio sample into fragments of 1.2s long (as 
a first approach) and we have measured the NCD between each pair of them. 
At this point, we have got a row of distances between each fragment and all 
the other fragments, which could be seen as a description of the information 
contained on a fragment. 

Once the audios have been segmented into fragments, we have examined 
the NCD distribution (between each fragment and the rest of fragments) in 
order to sort and select the most relevant fragments. One fragment could include 
bird songs of one out of two species, or noise. Hence, one could assume that 
among the fragment distances with all the other fragments, those that belong to 
the fragments of the same class should be nearer than the ones that belong to 
other class. According to this assumption, we examined the NCD distribution 
for each fragment with all the other fragments, searching for a (at least) bimodal 
distribution. In this manner, one mode should belong to those fragments similar 
to the one used as reference (same bird species), while the other(s) mode(s) will 
correspond to the distances of the less similar fragments (noise, other specie, 
etc.). 

The study of the distances’ distribution of each fragment revealed that some 
fragments follow our previous assumption (multimodality) better than others. 
In this work, we have used this fact to score the relevant information of each 
fragment, and discriminate the fragments by their noise. Hence, we have assumed 
that the fragments with a bigger distance between their modes will contain a 
better quality than the fragments with smaller distance between their modes 
(in terms of species discrimination). Next, we have taken the distances between 
each pair of modes, for each fragment, and sort the fragments according to it. 
Finally, we have reconstructed the original audios taking the n best fragments 
of each audio sample (according to their modes distances). 

From a technical point of view, we have made use of the zlib compression 
algorithm to calculate the NCD. This algorithm, together with the software 
to calculate the dendrogram, is provided by CompLearn Toolkit [4]. The audio 
samples have been processed with ffmpeg and lame linux packages, and optimized 
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using GNU Parallel [16]. In Fig. 2, we show the dendrograms produced by the 
clustering of the original objects and the clustering obtained with our method 
from the fragments of the initial objects. In this figure, one can see that our 
method improves the clustering quality from a slight separability (SC = 0.081) 
to an almost classified clustering (SC = 0.291). It is important to point out that 
the low clustering reported by the first case is due to the big size variety among 
the audio samples. 


4 Experimental Results 


For our experiments, we have taken two different species: The Great Horned Owl 
and the Common Redshank, from the xeno-canto database. Taking into account 
that our purpose in this work is to deal with the variety of size and noise, we have 
limited the audio samples to specific types of bird songs for each experiment. 


— Reassembled objects 
--- Unfragmemted objects 


5 10 15 20 25 
Number of fragments per object 


Fig. 3. Silhouette Coefficient for different objects’ configurations over the number of 
fragments contained in each object. Each object was reconstructed from a subset of its 
fragments sorted by their distribution modes distance. The black dot corresponds to 
the configuration of the riZght dendrogram of Fig. 2. The dashed black line corresponds 
to SC value of the left dendrogram depicted in Fig. 2 before applying our method. 


For each experiment, we have used the audios labeled as “song” (in xeno- 
canto), with no other type of bird call involved in the audio (for the selected 
species, the background bird songs can differ). The number of files has been 
reduced to only 14 per species. We have selected the audios randomly and equally 
balanced between species. In terms of format, we have used the .mp3 raw format, 
removing the metadata and normalizing the audio properties from each one of 
them. The configuration used for each audio was: sampling rate and frequency 
56 kbps and 22.05 KHz, respectively, and 8 bit width at a constant bit rate. It 
is important to point out that we have calculated the Silhouette Coefficient [14] 
for each experiment, in order to easily measure the separability of the clusters 
through their clustering quality. 

As described in Sect.2.1, we have re-assembled each object from its frag- 
ments according to their modes distance. Also, for each experiment, the number 
of fragments selected (X axis of Fig. 3) has been the same for each object (repeat- 
ing them if necessary) in order to reduce the problems introduced by the size 
differences. 
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In our experiments, we have obtained good results compared with the SC 
obtained from the original objects. Looking closely to the Fig.3, it can be 
observed that this method achieves an improvement over the base SC using 2-4 
fragments per object. The SC, however, tends to fall for bigger configurations 
(more fragments per object) with some noise for the final configurations. 

Finally, in order to test the capabilities of our method for a more hetero- 
geneous set of objects, we have performed an experiment over a bigger subset 
of audio samples. In this case, we have used 80 audio samples of the same two 
species, using 6 fragments to reassemble each object. The results obtained from 
this experiment (Fig. 4) show the impact of our method, improving the SC from 
0.08 to 0.22. 


Original bird songs 
(before our method) 


5 Reassembled bird songs 
(after our method) 


Fig. 4. Hierarchical clustering from the distances matrix of 80 audio samples, and 
their fragmented version, respectively. The blue nodes correspond to Common Red- 
shank species, while the black ones belong to the Great Horned Owl species. The left 
dendrogram is computed from the original distance matrix. The right dendrogram is 
obtained from the re-assemble of the audios from the fragments of the original objects. 
In this case, the best 2 fragments were selected from each object (according to their 
statistical mode distance). The Silhouette Coefficient for these dendrogram are 0.08 
and 0.22 respectively. (Color figure online) 


5 Conclusions 


In this work, we have proposed a novel method to use the normalize compression 
distance in datasets with a great size heterogeneity and high percentage of noise 
inside objects. We have tested this methodology to improve the identification of 
bird species in the xeno-canto database. Throughout this study, we have used an 
automatic segmentation selection methodology to extract the relevant informa- 
tion and prevent the loss of resolution, maintaining the parameter free nature of 
the NCD. 

Our hypothesis in this work is that finding the relevant intervals inside the 
audio samples and reassembling them in new equal-sized objects, is equivalent 
to solve the loss of resolution caused by the high percentage of noise and variety 
of object sizes. Hence, we have aimed to locate the best fragments (according 
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to their clustering) by segmenting each object into fragments. First, we have 
measured the quality of each fragment using the distance between the modes 
of its NCD distribution. Then, we have selected the n best fragments in order 
to reassemble the audio. Finally, we have measured the NCD between these 
reassembled objects as a more accurate representation of the distances of the 
original audios. 

The results presented in this paper show that applying our methodology has 
significantly improved clustering performance when compared to the results of 
the clustering without our methodology. As an example, Figs. 2 and 4 show the 
clustering differences between the clustering of the unprocessed and reassem- 
bled objects, respectively. With this approach, reasonable results of separability 
among species can be achieved without preprocessing the data. In the same 
manner, the proposed method performs a successful blind analysis without any 
consideration in the size or noise of the samples. 

As future work, we intend to test this new methodology over different data 
formats, such as wav, flac, etc. and as a complement to existing classification 
methods. In addition, we intend to explore different compression algorithms, 
birds species and audio databases, in order to test the capabilities of our method- 
ology. 
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Abstract. The detection of distance-based outliers from streaming data is crit- 
ical for modern applications ranging from telecommunications to cybersecurity. 
However, existing works mainly concentrate on improving the responding speed, 
none of these proposals can perform well in streams with varying data distri- 
bution. In this paper, we propose a Fast and Robust Outlier Detection method 
(FROD in short) to solve this dilemma and achieve the promotion in both 
detection performance and processing throughput. Specifically, to adapt the 
changing distribution in data streams, we employ the Active-Inliers-Pattern 
which dynamically selects reserved objects for further outlier analysis. Moreover, 
an effective micro-cluster-based data storing structure is proposed to improve the 
detection efficiency, which is supported by our theoretical analysis on the 
complexity bounds. Moreover, we present a potential background updating 
optimization approach to hide the updating time. Experiments performed on real- 
world and synthetic datasets verify our theoretical study and demonstrate that our 
algorithm is not only faster than state-of-the-art methods, but also achieve a better 
detection performance when the outlier rate fluctuates. 


Keywords: Outlier detection - Cybersecurity - Data streams 
Distance-based outliers 


1 Introduction 


Detection of outliers in data streams [1] is an essential task in several cybersecurity 
applications. An object is considered as an outlier if it significantly deviates from the 
typical case. There are many definitions of the outlier [4]. One of the most widely used 
is based on distance [6]. The definition is provided as follows where an important 
concept neighbor is introduced first: 


Definition 1 (Neighbor). Given a distance threshold R(R > 0), a data point o is a 
neighbor of data point o' if the distance between o and o' is not greater than R. A data 
point is not considered a neighbor of itself. 


© Springer Nature Switzerland AG 2018 
V. Kurkova et al. (Eds.): ICANN 2018, LNCS 11139, pp. 626-636, 2018. 
https://doi.org/10.1007/978-3-030-01418-6_62 


FROD: Fast and Robust Distance-Based Outlier Detection 627 


Definition 2 (Distance-based Outlier). Given a dataset D, a count threshold k(k > 0), 
a data object o; will be regarded as a distance-based outlier, if 0; has less than k 
neighbors in D. In general when R is fixed, k will change with the size of D to get better 
performance. 

According to the definition of outliers, we can easily find distance-based outliers in 
static datasets. However, when it comes to the data stream scenario because the dataset 
size is potentially unbounded, this process is performed over a fixed amount of real- 
time data instead to ensure computational efficiency. The most common approach is 
based on a sliding-window, which always maintains W most recent objects. When new 
objects arrive, the window slides to incorporate S new objects in the stream. As a result, 
the oldest S data points will be discarded from the current window. 

There are two goals for detecting distance-based outlier in data streams, (i) the 
accuracy of the data determination, (ii) and the responding speed of labeling the data 
objects. Unfortunately, these two goals are contradictory most of the times. For a data 
stream whose distribution changes dynamically, the window needs to be set large 
enough to resist the influence brought by the dynamic change of the data. However, if 
the window is set too large, the responding time will be greatly increased, so it will fail 
to satisfy the real-time performance. 
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Fig. 1. The working scheme of the algorithms based on sliding-window 


As mentioned before, the current distance-based outliers detection methods in data 
streams are based on sliding-window. These methods assume that the data distribution 
in the current window is similar to the global distribution, so they regard the outliers in 
current window as global outliers. Hence, they are prone to misjudgments in scenarios 
when the data distribution significantly changes such as a large-scale outbreak of 
outliers. For example, the DDoS attack is accomplished by making massive accesses in 
a short time to flood the targeted machine. 

Figure 1 shows an example of what happens in a real network. Assume that on the 
given data set D, the distance-based outlier detection can exert good performance while 
the proportion of abnormal data in D is p. When the outliers arrive in a burst manner 
(For example, y outliers occur continuously), the classic methods based on the sliding- 
window need to maintain a window of size A to ensure good performance, which will 


increase the responding time dramatically. 

In the model of the sliding window, the temporal cost is mainly focused on the 
updating of the model (the structure of the window) [8] in each sliding. On the original 
model without any optimization, the time complexity of each sliding is O(W?) (W refers 
to the size of the window). At present, many methods have been proposed to solve the 
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problem. They adopt various methods to reduce the time complexity required for this 
step, such as using duplicate calculation information to simplify operations [2, 9], 
designing special storage structures [3, 7], etc. 

We summarize the time complexity of some representative algorithm when the 
window slides each time and provide a side-by-side comparison in Table 1. 


Table 1. The temporal cost of some representative algorithms. 


Algorithm Time complexity 

Exact-storm [2] | O(Wlogk) 

AbstractC [9] oW? /s) 

DUE [7] OM tozw) 

Thresh LEAP [3] | g(W?logS/.y7 

MCOD [7] O((1 — c)Wlog(1 — c)W + kWlogk) 


Although these algorithms increase the responding speed, they still do not solve the 
problem of dynamically adapting the data distribution in essence. 

To solve the problems above fundamentally, in this paper, we developed a novel 
method to adapt to the dynamic changes of data distribution, aiming at the elimination 
of the limitations of previously proposed algorithms. Our primary concerns are the 
reliability to cope with outliers outbreaks and the promotion of efficiency and accuracy 
of detection. In summary, the major contributions of this work are as follows: 


1. We proposed a Fast and Robust Outlier Detection (FROD) algorithm based on the 
Active-Inliers-Pattern that can adapt to the dynamic changes of data distribution 
without storing a lot of data (large window), which considerably improves the 
responding speed under the premise of detection accuracy assurance. 

2. We adopt an effective structure based on micro-clusters for the proposed algorithm 
to maintain the Active-Inliers-Pattern. And corresponding updating strategies are 
given, which has been proved to have a better performance on real and synthetic 
datasets compared to the state-of-the-art techniques. 

3. We present theoretical bounds for its superiority and a possible optimization 
approach is given. 


The remaining of this paper is organized as follows. We present our methods in 
Sect. 2, whereas Sect. 3 contains the performance evaluation results based on real-life 
and synthetic data sets. Finally, Sect. 4 concludes the work and briefly discusses future 
work in the area. 


2 Methods 


Figure 2 shows the framework of FROD: when objects in the data streams arrive 
continuously, they are first stored in a Buffer. Once the trigger condition is satisfied, 
outlier detection will be performed in Buffer with the Active-Inliers-Pattern. Active- 
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Inliers-Pattern consists of selected inliers, which are maintained by a micro-cluster- 
based structure and these inliers dynamically updated by the detection results. 
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Fig. 2. The framework of FROD 


In this section, the algorithm FROD is described in detail. We start by introducing 
the Active-Inliers-Pattern which is used to detect outliers. Then an efficient structure 
based on micro-clusters is proposed to maintain the Active-Inliers-Pattern. After that, 
we depict complete workflow of FROD and provide an optimization approach to 
accelerate. Moreover, we theoretically prove that FROD is more effective than other 
detection methods in data streams. 


2.1 AIP for Outlier Detection 


We employ the Active-Inliers-Pattern (AIP), which is similar to the window in methods 
based on the sliding-window, storing only selected inliers instead. And then we create a 
Buffer at fixed size S(S <(size of) AIP). Each newly incoming data will be stored in the 
Buffer if the Buffer has not been filled yet. 

When the Buffer is full, we mark each object stored in the Buffer as inlier or outlier 
according to whether it has more than k neighbors in the AIP (this process is efficiently 
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Fig. 3. The working scheme of FROD 
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implemented, shown in Sect. 2.3). And an UpdateList is preserved to keep all the 
objects newly marked as inliers. Meanwhile, objects considered to be outliers are 
directly removed. 

After traversing all the objects in the Buffer, we calculate the number of inliers in 
the UpdateList and replace the corresponding number of oldest points stored in the AIP 
with the objects in the “UpdateList”. Correspondingly, the storing structure based on 
micro-clusters is also adjusted accordingly. The pseudocode is given in Algorithm 1. 


Algorithm 2: SharingPoint 


Algorithm 1: AIP-Update Input: An incomplete micro-cluster MC;with fewer than k 
Input : An old Active-Inliers-Pattern AIPo members 
An UpdateList / contains some inliers newly 1 Clusters=Find_neighborMC(MC)); 

arrived 2 /*Find_neighborMC() can find micro-clusters whose center 
Output: A new Active-Inliers-Pattern AIP is within a range R from MC;.center*/ 

1 AIP, = AIPo 3 count =|k x 1.1 — MC;.MemberNum| 

2 c = length(l)/*total item in UpdateList /*/ 4 while count + zero do 

3 dataList=Sort(A/Po.members) s | for each micro-cluster MC € Clusters do 

4 /*Rank members in increasing order of their arrival time*/ 6 if MC.MemberNum > kx 1.1 then 

5 for i:=1...c do 7 for each data d in MC do 

6 Remove dataList[i-1] from dataList 8 if isNeighbor(d, MC,.center) then 

7 Add /[i-1] to dataList 9 MC;.MemberNum + +; 

8 AIP¿.members + dataList 10 count ——; 

9 end u Put din MC;; 

10 for d € AIPı.members do 12 Remove d from MC; 

1 if CheckForAdjust(data) is ture then 13 end 

12 | AdjustMicroCluster(A/P;) 14 end 

B end 15 end 

14 end 16 | end 


15 return A/P;; 17 end 


As mentioned before, traditional methods based on the sliding-window are difficult 
to adapt to dynamic changes in data streams. As shown in Fig. 1, when outliers occur 
on a large scale, the window can only be set large enough to prevent misjudging local 
outliers as global outliers. However, when large-scale outliers erupt in the AIP model, 
shown in Fig. 3, the outliers are first separated into Buffers with the size of S. Since 
only the data judged as inliers can be saved, the AIP will not be “contaminated” by 
outliers. Therefore, a fixed size AIP can robustly cope with the outliers outbreak 
scenario, and judge the data label quickly and accurately. This also applies to other 
scenarios where data distribution changes. 


2.2 Micro-cluster-Based Storing Structure 


In the algorithm proposed above, we maintain a structure based micro-clusters to store the 
data in the AIP to help the model update more promptly, and we design a corresponding 
effective algorithm to evaluate range queries for each new object to all other active inliers. 

For each micro-cluster MC; in AIP, We set the radius to R/2, which means that any 
object belonging to MC; is in a range of R/2 from the center of MC;, and the minimum 
size of a micro-cluster is k+ 1. So each object o; belonging to MC; is definitely an 
inlier, because the maximum distance of any two objects in the MC; cannot exceed R. 
Also, the size of the MC; is at least k + 1, it means that o; has at least k neighbors within 
distance R. In general, an object may have neighbors that belong to other micro- 
clusters. 
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In the example of Fig. 4, We show the distribution of some points in AIP. There are 
four micro-clusters and some isolated data points. For the objects of each kind of them, 
a different symbol has been used. 


MC, 


mc, MC, 


\ $ K 1 H MC, 


Fig. 4. The distribution of some points in AIP with k = 5. 


Objects that do not belong to any micro-cluster are stored in a structure called a 
Singular Queue (SQ), which are depicted with the “star” symbol. Besides, for each 
point in SQ (e.g., pı), we keep a list containing the identifiers of the micro-clusters, 
whose centers are less than 3R/2 far away from this point. As shown in Fig. 4, pı can 
only find its neighbors in these micro-clusters, which can be easily proved. Conse- 
quently, when we calculate the number of p,’s neighbors, we only need to detect 
among members of these micro-clusters, which can accelerate considerably. 

When the micro-clusters are thought of as spheres with radius R/2, they can be 
some overlap (e.g., MC,, MC4), although an object can only belong to a single micro- 
cluster at one time. For these points in the overlapping area, we have set up a unique 
Sharing-Point mechanism so that they can be dynamically adjusted. For example, 
when p is eliminated, the members in MC; are less than 5, but we will not disperse this 
MC, in a hurry like other algorithms usually do. We set it as an incomplete micro- 
clusters instead. After all old points are eliminated, we check if these incomplete micro- 
clusters can “share” points from other micro-clusters to become normal micro-clusters. 
The pseudocode of these operations is given in Algorithm 2. 


2.3 Workflow of FROD 


The primary rationale behind our approach is to drastically reduce the number of 
micro-clusters that need to be reorganized when the AIP is updated and the complexity 
in finding neighbors for each data point when performing outlier detection. The 
detailed steps of the FROD algorithm are as follows: 


Step 1: When the data arrives consecutively, we first determine whether an AIP 
already exists. If not, the initialization operation is performed. Else, proceed to the 
next step. 
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Step 2: The newly arrived data will be added to Buffer, and outlier detection is 
performed on the data in Buffer: 

(2a) For each object p in Buffer, find the nearest micro-cluster MC; from p and 
calculate the distance from p to MC;, and if the distance is less than R/2, put p into 
UpdateList. Else, find neighbor of p in SQ; 

(2b-i) If p has no less than k neighbors in SQ, put p into UpdateList. Otherwise, we 
find micro-clusters whose center point is less than 3R/2 far away from p and look 
for neighbors in these micro-clusters; 

(2b-ii) If p has more than k neighbors in total, put p into UpdateList. Otherwise, p is 
added to OutlierList, and the Buffer will be cleared. 

Step 3: AIP is updated with the UpdateList generated from the previous step: 
(3a) Calculate the length / of UpdateList, and find the / oldest points in AIP, then 
remove them from AIP, if the removed point once belonged to a micro-cluster, 
decrement the number of members of this micro-cluster by one. Similarly, the 
points in UpdateList are added to the AIP. If these newly added points belong to a 
certain micro-cluster, the number of members in the micro-cluster is increased by 
one. 

(3b) If there are incomplete micro-clusters in the AIP, for each incomplete micro- 
cluster, we check if it can share some neighbors from its neighbor-clusters to make 
itself a normal cluster. And reconstruct the micro-cluster in AIP based on the result. 
(3c) If there are still incomplete micro-clusters, these clusters will be dismantled, 
and their points will be added to the SQ. 

Step 4: Report the outliers with OutliersList, then re-execute the step 2. And if 
outliers thrown by the algorithm is manually determined as inliers [5], these points 
will still be added to the UpdateList to prevent the occurrence of inliers that have 
never appeared. 


2.4 Optimization and Analysis 


In this section, we present an approach to optimize our algorithm, and then we analyze 
the temporal cost of the FROD, by comparing FROD with other popular outliers 
detection algorithm to illustrate its superiority. 


Optimization. Since normal data is similar in FROD, we needn’t update the AIP each 
time the Buffer cleared. Instead, the AIP update process (refer to step 3) can be exe- 
cuted in the background. When outlier detection is performed, we check whether the 
update module generates a new AIP first. If so, the new AIP will be used for detection, 
and the Update module will be triggered again with the newly generated UpdateList. 
Otherwise, the original AIP is kept for detection. 


Analysis. Let 0 <c< 1 denote the fraction of the window stored in micro-clusters then 
the number of data points in SQ is (1 —c)W, and the number of micro-clusters is 
approximately equal to log (w). Because the model update module can be executed in 
the background, the temporal cost is mainly concentrated on step 2 with a time 
complexity of O(S(log(¥) + (1 — c)’W)), and the value of c is proved to be very 
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large (close to 1) in practical experiments, so it can be approximated as O(Slog(w/k)), 
which outperforms other popular algorithms. 

Compared with the MCOD method proposed in [6], which is also based on the 
concept of micro-cluster, FROD can not only adapt to the dynamic change of data 
distribution in the data streams but also improve the response speed significantly. 

In the phase of outlier detection, the time complexity of MCOD is O((1 — c)Wlog 
((1 — c)W) + kWlogk), which is much larger than FROD. Furthermore, FROD has set 
up an efficient sharing point mechanism that allows a micro-cluster can be incomplete 
temporarily. It helps the AIP update quickly and efficiently. However, the MCOD is to 
eliminate those clusters whose number of members is less than k while eliminating the 
points, which will exceedingly increase the number of operations required for updating, 
especially when S is large. 


3 Experiments 


3.1 Experimental Methodology 


To evaluate the performance of the proposed algorithms, we compare FROD with four 
typical distance-based outlier detection methods in both responding time and detection 
accuracy. These methods are referred to as MCOD [7], Abstract-C [9], LUE [7], 
ExactStorm [2]. Our experiments were conducted on a macOSSierra machine with a 
2.2 GHz processor and 15 GB Java heap space. 


Datasets. We chose the following datasets for our evaluation. FC (Forest Cover) is 
available from the UCI KDD Archive, which is also used in [2], containing 581,012 
records with 55 attributes. TAO is available at Tropical Atmosphere Ocean project, 
containing 575,648 records with 3 attributes. STOCK is available at UPenn Wharton 
Research Data Services, which is also used in [3] containing 1,048,575 transaction 
records with 1 attribute, Gauss is synthetically generated by mixing three Gaussian 
distributions and a random noise distribution, it contains 1 million records with 1 
attribute. 


Default Parameter Settings. There are four parameters to be determined: the size of 
the sliding-window or Active-Inliers-Pattern: W, the size of slide or Buffer: S, the 
distance threshold R, and the neighbor count threshold k. 

W is the key parameters in influencing the responding speed which determine the 
volume of data streams, and we set it as 1k as default. For fairness of measurement, we 
set S to 5% of W, for all datasets. In general, when R is fixed, k should change with W 
to ensure that the data distribution in current window is approximately equivalent it in 
global. For all datasets, we maintain the outlier rate as 1%. In our experiment, we 
maintain the entire outlier rate as 1% for all datasets, and the default value of R is set to 
525 for FC, 1.9 for TAO, 0.45 for Stock, and 0.028 for Gauss according to [8]. Based 
on this, the default value of k is set to W/200 for all datasets to ensure the accuracy. 
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Responding Time. In this part, we measure the responding time of data labeling. To 
present the results clearly, we compare the sum of responding time for every 100k 
objects with varying W in the range [1 - 2k]. 

As shown in Fig. 5, when W increases, the CPU time for each algorithm increases 
as well, yet our best solution FROD consistently utilizes the least CPU time and 
exhibits the slowest increase in CPU consumption for all database. It’s about 4 to 6 
times faster in FC. Meanwhile, FROD can up to 1 — 2 orders of magnitude faster than 
the state-of-the-art in TAO and Stock. 

Moreover, we can notice that MCOD which is also based on micro-cluster structure 
also shows good performance in most cases. However, when W is larger than 5k in the 
FC, the responding time grows rapidly, this is mainly because when W increases in FC, 
more and more outliers which do not participate in any micro-cluster will be added to 
the window, and many additional reorganizations will be performed. Both of them will 
greatly increase the processing speed, while FROD can exclude outliers from AIP and 
update AIP efficiently with the “Sharing-Point” mechanism. 


+ FROD + MCOD LUE + FROD + MCOD LUE + FROD MCOD LUE 
Abstract-C ^> Exact-Storm + Abstract-C Exact-Storm © Abstract-C > Exact-Storm 
1000 


1004 


CUP-Time(s) 

CUP-Time(s) 
3 

CUP-Time(s) 


+ 
do 


a4 


T T t T T T 
2K 5K 10k 15k 20k 1K 2K 5K 10k 20k 1K 2K 5K 10k 15k 20k 
w w w 


(a) FC (b) TAO (c) STOCK 


Fig. 5. The comparative results of different algorithms running on three datasets 


Accuracy. In this section, we analyze the robustness of the methods in facing the 
dynamic changes of data distribution, we first label all the data according to their 
distribution in the entire Gauss data set and then calculate the frequency of outliers 
occurrences in units of 1K. We have selected a piece of data with abnormal fluctuations 
to compare the performance of each algorithm under different data distributions. We 
use Fl-score [10] to measure the accuracy of outlier detection. 

As shown in Fig. 6, FROD maintains a high detection performance when the 
distribution of data in the stream fluctuates. In contrast, although other algorithms can 
perform well when the outlier rate fluctuates little, the accuracy of these methods will 
degrade significantly in the presence of outlier outbreaks because the windows they 
maintained are contaminated by the flocking outliers. So they may misjudge outliers 
and inliers, while the FROD that only retains normal points does not have this problem. 
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Fig. 6. The comparative results of different algorithms running on three datasets 


4 Conclusion 


Outlier detection for extracting abnormal phenomena from dynamic streaming data is a 
crucial yet difficult task. In this paper, we study the problem of continuous outlier 
detection over data streams by using an Active-Inliers-Pattern. We employ the Active- 
Inliers-Pattern to adapt the distribution changing in data streams to ensure the accuracy 
and improve the responding speed by setting an effective structure based on the micro- 
cluster for objects storing. Besides, our experimental evaluation with both real and 
synthetic datasets shows that our approach can perform well even when the distribution 
of data is dynamically changing, and it is also faster than the state-of-the-art methods. 

For future work, a meaningful direction is to design a distributed algorithm to 
implement the model update phase of FROD, aiming at the significant improvement of 
efficiency under the premise of detection accuracy assurance. 


Acknowledgement. The authors would like to thank the anonymous reviewers for their valu- 
able comments. This work was supported by the National Key Research and Development 
Program (Grant No. 2016YFB1000101), the National Natural Science Foundation of China 
(Grant No. 61379052), the Natural Science Foundation for Distinguished Young Scholars of 
Hunan Province (Grant No. 14JJ1026), Specialized Research Fund for the Doctoral Program of 
Higher Education (Grant No.20124307110015). 


References 


1. Aggarwal, C.C.: Outlier Analysis. Data Mining, pp. 237-263. Springer, Cham (2015). 
https://doi.org/10.1007/978-3-319-14142-8_8 

2. Angiulli, F., Fassetti, F.: Detecting distance-based outliers in streams of data. In: Proceedings 
of the Sixteenth ACM Conference on Information and Knowledge Management, pp. 811- 
820. ACM (2007) 

3. Cao, L., Yang, D., Wang, Q., Yu, Y., Wang, J., Rundensteiner, E.A.: Scalable distance- 
based outlier detection over high-volume data streams. In: Data Engineering (ICDE), IEEE 
30th International Conference on 2014. pp. 76-87. IEFE (2014) 


636 


10. 


Z. Li et al. 


. Huang, H., Kasiviswanathan, S.P.: Streaming anomaly detection using randomized matrix 


sketching. Proc. VLDB Endowment 9(3), 192-203 (2015) 


. Kalyan, V., Ignacio, A., Alfredo, C.: AI2: training a big data machine to defend. In: IEEE 


International Conference on Big Data Security, New York (2016) 


. Knox, E.M.: Algorithms for mining distance based outliers in large datasets. In: Proceedings 


of the International Conference on Very Large Data Bases, pp. 392-403. Citeseer (1998) 


. Kontaki, M., Gounaris, A., Papadopoulos, A.N., Tsichlas, K., Manolopoulos, Y.: 


Continuous monitoring of distance-based outliers over data streams. In: Data Engineering 
(ICDE), IEEE 27th International Conference on 2011. pp. 135-146. IEEE (2011) 


. Tran, L., Fan, L., Shahabi, C.: Distance-based outlier detection in data streams. Proc. VLDB 


Endowment 9(12), 1089-1100 (2016) 


. Yang, D., Rundensteiner, E.A., Ward, M.O.: Neighbor-based pattern detection for windows 


over streaming data. In: Proceedings of the 12th International Conference on Extending 
Database Technology: Advances in Database Technology, pp. 529-540. ACM (2009) 
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 
22nd Annual International ACM SIGIR Conference on Research and Development in 
Information Retrieval, pp. 42-49. ACM (1999) 


(MR) 


Check for 
updates 


Unified Framework for Joint Attribute 
Classification and Person Re-identification 


Chenxin Sun', Na Jiang', Lei Zhang", . Wang”, Wei Wu!, 
and Zhong Zhou! 


! State Key Laboratory of Virtual Reality Technology and Systems, 
Beihang University, Beijing, China 
zzQbuaa. edu. cn 
2 Department of Computer Science, 
Texas A&M University — Commerce, Texas, USA 


Abstract. Person re-identification (re-id) is an essential task in video surveil- 
lance. Existing approaches mainly concentrate on extracting useful appearance 
features from deep convolutional neural networks. However, they don’t utilize 
or only partially utilize semantic information such as attributes or person ori- 
entation. In this paper, we propose a novel deep neural network framework that 
greatly improves the accuracy of person re-id and also that of attribute classi- 
fication. The proposed framework includes two branches, the identity one and 
the attribute one. The identity branch employs the refined triplet loss and 
exploits local cues from different regions of the pedestrian body. The attribute 
branch has an effective attribute predictor containing hierarchical attribute loss 
functions. After training the identification and attribute classifications, pedes- 
trian representations are derived which contains hierarchical attribute informa- 
tion. The experimental results on DukeMTMC-reID and Matket-1501 datasets 
validate the effectiveness of the proposed framework in both person re-id and 
attribute classification. For person re-id, the Rank-1 accuracy is improved by 
7.99% and 2.76%, and the mAP is improved by 14.72% and 5.45% on 
DukeMTMC-reID and Market-1501 datasets respectively. Specifically, it yields 
90.95% in accuracy of attribute classification on DukeMTMC-reID, which 
outperforms the state-of-the-art attribute classification methods by 3.42%. 


Keywords: Deep learning - Person re-identification - Attribute classification 


1 Introduction 


Person re-identification (re-id) aims at retrieving persons from non-overlapping cam- 
eras or different timetamps. Recently, person re-id has been drawing increasing 
attention from both academia and industry in that it has broad applications in 
surveillance systems for efficiently preventing and tracking crimes. However, the 
effects caused by factors like viewpoint variations, occlusion and illumination condi- 
tion differences potentially make the person re-id an extremely challenging task. 

As deep learning arises in the recent years, deep convolutional neural networks 
have been widely used in person re-id and yielded promising performance [1, 2]. 
However, when being applied to real scenarios, these methods tend to be less effective 
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due to the lack of detailed cues. In [3], person re-id model is proposed to utilize 
different parts of the image therefore it can extract regional features containing local- 
ized information. The feature maps of different regions of a person appear quite dif- 
ferent, which makes the body region alignment of great importance for person re-id. In 
our re-id framework, we use accurate keypoint locations of a person through keypoint 
detection to extract desired body regions. 

Another common used solution is to exploit person attributes with consideration 
that the attribute information may contain some domain cues which are identified as the 
powerful complementary information in the person re-id task [4-7]. Theoretically, 
attributes often represent a high level feature of a pedestrian which could be easily 
missed by approaches based on appearance features. As shown in Fig. 1, people with 
similar appearance can be easily distinguished by attribute information, which moti- 
vated us to study this problem. To solve it, we integrate attribute information into the 
CNN model for re-id task using our framework. 


long sleeves short hair blue pants female 


handbag short sleeves long hair black pants male 


Fig. 1. Examples of pedestrians in similar appearance with different attribute labels. The 
attribute labels (e.g., bag vs. handbag, long sleeves vs. short sleeves, etc.) are denoted as 
discriminative information to distinguish the pedestrians. 


The main contributions of this paper are as: (1) A deep neural network incorpo- 
rating body parts and pose information is proposed. (2) A hierarchical loss guided 
structure is used to extract meaningful attribute features and consequently to combine 
the attribute representation with the appearance representation for better re-id. 
(3) Experiment results on DukeMTMC-reID and Market-1501 datasets demonstrate the 
effectiveness of the proposed framework. We outperform the state-of-the-art re-id 
methods in terms of mAP and Rank-1. 
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2 Related Work 


Person re-identification is first introduced and studied by Zajdel et al. [8] in 2005. It is 
assumed that every individual is associated with unique hidden labels. They design a 
dynamic Bayesian network to encode the statistical relationships between the features 
and the labels of the same identity. Typical traditional person re-identification methods 
use color or hand-crafted features as feature descriptors. Liao et al. [9] design the Local 
Maximal Occurrence Representation together with a XQDA metric learning approach 
for person re-id. 

Convolutional Neural Networks have first been used for person re-id by [2, 10]. [2] 
splits the input person images into three horizontal strips processed by several con- 
volutional layers independently. Meanwhile, there are approaches [10, 11] which solve 
re-id problem from the aspect of directly minimizing the feature distance between 
image pairs or triplets. The Siamese model proposed by Li et al. [10] takes two images 
as input, directly ending with a same person /different person classification through a 
deep neural network. Cheng et al. [11] extend this idea and design a similar framework, 
which processes three images at a time and introduces the triplet loss for metric 
learning. There are also methods which extract more efficient person features from a 
tree-structured competitive neural network [3] or different levels of neural network 
representations [1]. 

Visual semantic attributes have been investigated in the studies [4-7, 12]. Zhang 
et al. [4] compute the appearance distances and the attribute distances from two sep- 
arate models and fuse these two distances together to get the final ranking list. To train 
unified neural networks, a few methods [5-7] use identification and attribute classifi- 
cation loss at the same time to encourage the neural networks to capture both identi- 
fication and attribute information. However, the information extracted from different 
domains are difficult to integrate using loss aiming to solve distinct problems. Su et al. 
[12] propose a weekly supervised multi-type attribute learning algorithm which only 
uses a limited number of labeled attribute data. In their work, Su et al. employ a three- 
stage fine-tune strategy to train the model either on attribute datasets or other datasets 
only labeled with person IDs. The work closest to this paper is [6], in which a com- 
bination of re-id and attribute classification losses is used to learn overall representa- 
tions for person re-id. 


3 Proposed Approach 


We propose a novel deep neural network framework that jointly learns person re- 
identification and attribute classification, as shown in Fig. 2. Our approach includes an 
identity branch based on DenseNet-121 and an attribute branch based on ResNet-50 to 
learn identity and attribute classification respectively. In Fig. 2, the upper part of the 
framework is the identity branch while the lower part is the attribute branch. At 
inference time, given as input a person image, we combine identity feature vectors and 
attribute feature vectors extracted from identity and attribute branch respectively to get 
the final re-id feature vectors. We then rank the gallery images according to their 
feature distances to the final representations of the retrieving images. In the following 
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part, we first describe the detail of identity learning framework in Sect. 3.1 and then the 
attribute classification structure in Sect. 3.2. 
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Fig. 2. Overview of our approach. Inputs are quintuples described in Sect. 3.1. 


3.1 Identity Learning Framework 


To mitigate occlusions and reduce misalignments, several person re-id studies combine 
global features with local features which are extracted from certain body parts. Com- 
pared with fixed mandatory horizontal strips, accurate body part segmentation can yield 
more representative local features and greatly eliminate the influence of background. 
Inspired by such observation, we use the PAFs model [13] to localize fourteen accurate 
body keypoints and pool three ROI (Region-of-Interest) areas, head, UpperBody and 
LowerBody, from the feature maps according to the locations of the keypoints. In each 
forward process, four feature vectors, extracted from the main full image branch and 
three body part branches, are concatenated to one identity vector which is used for 
model training, represented by colored rectangle in Fig. 3. Three images on the yellow 
shadow produce the Triplet loss while three images on the green shadow produce the 
Orientation loss. Then these two losses are added together to get the identity loss. 

In the training process, we introduce a new orientation-based triplet loss based on 
the traditional triplet loss [14] in the proposed identity learning model. Concretely, The 
traditional triplet loss is trained on triplets {x#,x?, x7}, where x“ and x? denote two 
different images of the same person i, while x? is the third image of a different person. 
The purpose of triplet loss is to train the network to pull x“ closer to x’ and push away 
x, as formulated as following: 


LosSripter = Max (d (F (x?) ,f(x?)) - a(f (20), f(x7)) + a, 0) (1) 


where f(x) is the feature of the image x, and d(x, y) represents the distance between x 
and y. « represents the margin between positive pairs and negative pairs. 
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Fig. 3. Identity learning network. Inputs of the convolutional neural network are quintuples 
including the original image, the positive example, the negative example and two positive 
examples with same /different orientation, represented by Anchors, P, N, Ps, Pd respectively. 
(Color figure online) 


In our identity learning framework, we argue that we further improve the perfor- 
mance of triplet loss with the pose information. Smaller feature distances between 
positive samples with the same orientation can be achieved according to the following 
loss: 


LoSSorientation = MAX (a(r(x) JE) - a(f (xi) f a ) +8, 0) (2) 


where x?” represents the positive sample having the same orientation with anchor 


sample x“, while x? u represents the positive sample having the different orientation. $ 
represents the margin between the same orientation pairs and different orientation pairs. 
Other symbols in Eq. (2) are the same as the symbols in Eq. (1). 

As for the accurate orientation of the images, we use the orientation classification 
results from the attribute classifier. 

The overall loss function for identity learning is formulated as: 


LOSSidentity = LOSStriplet + @ * LOSS orientation (3) 


where œ is a weight balancing the two losses of different purposes. 


3.2 Attribute Classification 


Attributes classifiers are designed to effectively predict the attribute labels and provide 
meaningful feature vectors to the identity branch for offering complementary infor- 
mation. We dynamically tune training strategies for differentiated phases. 


Phase 1. Person attribute classification is formulated as a multitask problem, which 
requires optimizing all attribute predictors. Suppose we have N training images ],, 
(i= 1,...,N ) labeled with M attributes Label, (j = 1, ..., M). We need to learn M 
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predictors ph) to minimize the difference between the expected output of predictors 
and the labels, and it can be formulated as follows: 


>. ae Loss(;(1;) — Label;;) (4) 


where Loss (-) in Eq. (4) is the loss function that calculates the difference between the 
output of each predictor and label; in our experiment, we choose the square loss as loss 
function. 

In the process of training, we observe some attributes have different convergence 
rates and training difficulties and some attributes like “backpack” and “upwhite” appear 
more frequently than others. To capture such facts, we follow the approach [14] 
weighting the attributes in the loss function: 


N M 
Dies Dh 2; * Loss(0;(1;) — Labeli) (5) 


where A; is the scalar value to weight the importance of attribute j to overall loss 
function. 

Instead of manually tuning the hyper-parameter /; using methods like cross vali- 
dation, we propose an adaptive method to update A; every k iterations during training. 
In each batch, we separate the training images into two parts: the training part and the 
auxiliary part, all of which are passed through the neural network. We get two kinds of 
loss vectors from the output of the neural network. But only the loss vector obtained 
from the training part is used to update the neural network, while the loss vector 
obtained from the auxiliary part is stored in a data structure Loss; used to update the 
weight vector 1. We formulate the weight update algorithm in Eqs. (6) and (7). 


na | [Lost e Loss nA] norm” [Lossin—xn]] | (6) 


norm 


[| = V — Vmin (7) 


Vmax — Vmin 


where / is a M-dim vector, - stands for dot product, Loss; is a data structure storing the 
auxiliary loss vectors, n is the number of losses stored in Loss,), Lossjp.q Stands for an 


average loss whose every element is the mean value of the corresponding elements 
from Loss, to Loss;, |] is the normalization function in Eq. (7), Vmin and Vmax refer 


norm 
to the minimum and the maximum values in vector v respectively, and k is set to 12 
with experiential experience in our experiment. 
In Eq. (6), the [Loss„_.) — LOSStn-2n—4] norm 
attributes to be larger ones whose current losses change drastically compared to pre- 
vious losses, while the [Zossen] 


factor encourages weights of certain 


factor encourages weights of the other kind of 


norm 


attributes to be larger which have not converged. To this end, we keep training our 
attribute classification network using the weighted loss until convergence, as shown in 
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Fig. 4(a). When we train the attribute classification network with our identity learning 
network, we use an adaptive strategy to assist the re-id task discussed in Phase 2. 
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Fig. 4. A figure caption is always placed below the illustration. Short captions are centered, 
while long ones are justified. The macro button chooses the correct format automatically. 


Phase 2. It is noted that attributes in datasets are generally classified into two groups 
according to whether they can be assigned to certain image regions. The attributes like 
the color of upper clothes, backpack, and the color of lower clothes, rely on small 
regions of images rather than the whole images. Based on this observation we design a 
multi-branch framework for efficient attribute classification which predicts region based 
attributes respectively, as shown in Fig. 4(b). We initialize the weights of our deep 
neural network gained by Phase 1 and use the locations of ROI regions in Sect. 3.1 to 
pool three regions from the first pooling layer. 

Besides, according to the influence of attribute labels on the person re-id task, we 
choose several attribute labels to train the overall framework using Eq. (4), discarding 
the prediction layer trained in Phase 1. The selective attribute loss from the main branch 
together with three losses from region based branches constitute our hierarchical loss. 


4 Experiments 


4.1 Implementation Details 


In our experiments, we choose DenseNet [15] model as our identity branch and ResNet 
[16] as the attribute classification branch. For the identity branch, it includes a back- 
bone network and three body part subnetworks. They share the weights from the first 
convolutional layer to the first dense block. We add an ROI pooling layer behind the 
first dense block to pool three areas from the shared feature maps according to the 
output of PAFs keypoint estimator. The backbone network and three subnetworks all 
have four dense blocks with different growth rates. For the attribute branch, the 
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network is designed similarly like the identity branch except that the attribute branch 
uses proposed hierarchical loss as the objective function. 

In the training phase, we firstly use Lossidentiry in Eq. (3) to train the identity branch 
and loss in Eq. (5) to train the attribute branch separately until they converge. Sec- 
ondly, we fix the layers before the pooling! layer in our attribute branch and copy the 
layers after the pooling! layer to from 3 region based subnetworks. Using proposed 
hierarchical loss in Phase 2, we train the region based subnetworks and the main 
attribute branch until convergence. Finally, we concatenate the feature vector extracted 
from the identity branch and the feature vector obtained from the main part of our 
attribute branch to get a final re-id feature vector as shown in Fig. 2. Then we calculate 
the classification loss using this final re-id feature vector, and finetune the whole 
framework using classification loss and hierarchical loss until convergence. 

In the testing phase, we extract a 3048-D feature vectors from the final fused layer. 
This feature vector has not only identity discriminability but also attribute information. 
We use this 3048-D feature for person re-id. 


4.2 Performance on Attribute Classification 


To evaluate the effect of the attribute domain learning, we conduct the attribute clas- 
sification on DukeMTMC-reID [17] and Market-1501 [18] datasets. In such a way, the 
identity and attribute labels are obtained for the designed framework. 


Table 1. Attribute recognition accuracy on DukeMTMC-reID 


Methods | Gender Hat | Boots | Top | Backpack | Handbag | Bag | Shoes | Upcolor | Downcolor | Mean 
SVM [20] | 77.03 | 82.24 | 82.45 | 87.64 | 69.59 93.60 83.01 | 90.05 | 70.94 | 68.48 80.50 
APR [10] | 82.61 | 86.94 | 86.15 | 88.04 | 77.28 93.75 82.51 | 90.19 | 72.29 | 41.48 80.12 
Baseline | 83.12 | 81.09 | 80.52 | 89.91 | 76.05 90.06 81.08 | 81.92 | 75.54 | 70.55 80.98 
Ours 88.94 | 82.97 | 80.13 | 93.60 | 87.02 89.60 91.60 | 83.65 | 93.94 | 91.84 90.95 


Table 2. Attribute recognition accuracy on market-1501 


Methods Handbag | Bag 
APR [10] 88.98 
Baseline | 81.08 | 85.39 | 70.49 | 87.47 | 84.59 | 81.51 86.22 85.18 | 67.30 | 92.10 | 71.57 71.05 80.33 
Ours 88.94 | 84.76 | 78.26 | 93.53 | 92.11 | 84.79 | 85.46 88.40 | 67.28 | 97.06 | 87.50 87.21 86.98 


Gender Clothes | Backpack Upcolor | Downcolor 


In Tables 1 and 2, we compare the attribute recognition accuracy of the proposed 
method with two state-of-the-art ones, Baseline and APR [5]. Baseline denotes the 
attribute branch trained by loss in Eq. 4 and Ours represents the attribute classifier 
finetuned by weighted attribute loss in Eq. 5. As shown in the tables, we have achieved 
competitive results in these two datasets and the proposed framework significantly 
outperforms the baseline. It is worth noting that the results in [14] are also very 
competitive with the mean average accuracy of 87.53% and 88.49% on the 
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DukeMTMC-relD and Market-1501 datasets. Our framework achieves 90.95% accu- 
racy on DukeMTMC-reID, outperforming all state-of-the-art methods by 3.42%. 


4.3 Performance on Person Re-Identification 


In this section, we evaluate the performance of our method on the DukeMTMC-reID 
and Market-1501 datasets. 


Table 3. Comparison with the state-of-the-art approaches. 


DukeMTMC-reID Rank-1 | mAP | Market-1501 Rank-1 | mAP 
LOMO + XQDA [9] 30.8 17.0 LOMO + XQDA [9] 43.80 | 47.78 
GAN [17] 67.68 | 47.13 GAN [17] 79.33 | 55.95 
Loss Embedding [19] 68.90 | 49.30 Loss Embedding [19] 79.51 | 59.87 
APR [6] 70.69 | 51.88 ACRN [7] 83.61 | 62.60 
ACRN [7] 72.58 |51.96 APR [6] 84.29 | 64.67 
Baseline 67.58 | 47.46 Baseline 72.50 | 45.23 
Baseline + Triplet 72.33 |51.72 | Baseline + Triplet 81.32 | 61.50 
Baseline + Improved Triplet | 75.72 | 56.20 | Baseline + Improved Triplet | 85.88 | 67.28 
Ours 80.57 | 66.68 Ours 87.05 | 70.12 


Table 3 shows the performances of the proposed method comparing to that of 
several state-of-the-art methods. Baseline represents our identity network without the 
triplet loss, Baseline + Triplet represents identity network with the original triplet in 
Eq. 1, Baseline + Improved Triplet represents identity network with proposed triplet 
loss in Eq. (3) and Ours represents the results of our overall framework in Fig. 2. As 
shown in Table 3, the Rank-1 accuracy is improved by 7.99% and 2.76%, while the 
mAP is improved by 14.72% and 5.45% on DukeMTMC-reID and Market-1501 
datasets respectively in our overall framework. This result shows the effectiveness of 
proposed attribute information transferring. With the use of triplet loss and proposed 
attribute supplementary information, we can observe significant improvement in the 
final results. 


5 Conclusion 


In this paper, we have presented a deep convolutional neural framework employing 
hierarchical attribute information for person re-identification. With the joint learning of 
the identity and attribute supervision from the same dataset, we invoke information 
transferring from the attribute domain to the identity domain which is used as sup- 
plementary information. According to the evaluation results, the proposed framework 
shows highly accurate attribute and person re-id comparing to the state-of-the-art 
methods in the field on two datasets. 
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Abstract. This paper introduces a new associative approach for significant 
acceleration of k Nearest Neighbor classifiers (KNN). The kNN classifier is a 
lazy method, i.e. it does not create a computational model, so it is inefficient 
during classification using big training data sets because it requires going 
through all training patterns when classifying each sample. In this paper, we 
propose to use Associative Graph Data Structures (AGDS) as an efficient model 
for storing training patterns and their relations, allowing for fast access to nearest 
neighbors during classification made by kNNs. Hence, the AGDS significantly 
accelerates the classification made by kNNs, especially for large and huge 
training datasets. In this paper, we introduce an Associative Acceleration 
Algorithm and demonstrate how it works on this associative structure sub- 
stantially reducing the number of checked patterns and quickly selecting k 
nearest neighbors for KNNs. The presented approach was compared to classic 
KNN approaches successfully. 


Keywords: Classification - K nearest neighbors - Associative acceleration 
Brain-inspired associative approach - Associative Graph Data Structures 


1 Introduction 


Today, in computer science, we need to face computational difficulties of Big Data 
[4, 18], and create new efficient models operating on Big Data producing intelligent 
systems for various uses [14]. The big problem of Big Data processing is not only 
about computational methods but also about the data structures which we use for 
representing data because they significantly influence the effectiveness of algorithms 
implemented to big amounts of data. Data stored in traditional data structures (usually 
tables and relational databases) is easy to read and interpret for humans. Such structures 
do not represent many important relations [7] that must be searched in many nested 
loops spoiling computational complexity and efficiency of data access [9]. We try to 
overcome a part of these inefficiencies using biologically inspired associative mecha- 
nisms [6, 7, 13]. We focus on a very popular k Nearest Neighbor classifier that is easy 
to use and supply users with satisfactory results without a big effort or time invested in 
designing and training more advanced computational intelligence models [4, 17]. 

K Nearest Neighbors (kNN) were already widely studied, extended and described 
in many papers, e.g. [1, 2,5, 11, 16], where fuzzy-logic, genetic algorithms, rough sets, 
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various trees and other approaches were used to improve the efficiency of KNNs. An 
interesting approach presented in [20] is using a weighted voting method for KNN. In 
this approach, the neighbor which is closer to test object is weighted more heavily. 
Similar solution based on weighted voting approach was also shown in [21]. 

This paper describes Associative Graph Data Structures (AGDS) [6, 7], and a 
specially developed algorithm operating on these structures that allows finding k 
nearest neighbors very quickly, i.e. without looking through all training patterns, but 
checking only a limited subset of them using special features of the AGDS structures. 
They allow us to move to close or similar objects in constant time, so we can also 
compute the distance of the limited subset of close training patterns very quickly 
pointing out k nearest neighbors. This paper shows advantages of AGDS structures and 
their use for acceleration of KNN classifiers. The AGDS structures remove the 
inconvenience of classic tabular structures typically used by KNN classifiers [4]. The 
main contribution is the presentation of an Associative Acceleration Algorithm for 
KNN classifiers using AGDS structures and comparisons of its speed to the classic 
approaches. 


2 Associative Graph Data Structures 


Associative Graph Data Structures (AGDS) first introduced by Horzyk in [6] and 
accelerated by the use of AVB+trees in [7] were inspired by the associative processes 
that take place in real brains [12]. They are defined as graphs of nodes representing 
aggregated, counted, and sorted attribute values represented by value nodes and objects 
defined by the values and represented by object nodes. Value nodes represent unique 
attribute values and are connected to the object nodes defined by the values of the 
connected value nodes. Moreover, AGDS structures can contain additional nodes 
representing subsets or ranges of values, and various combinations of values or objects, 
defining clusters or classes, as well as various dependencies and relations between the 
nodes of such graphs, e.g. the sequence in time or the proximity or neighborhood in 
space. AGDS structures can directly represent a lot of useful features and relations 
between stored values and objects, e.g. neighborhood, similarity, proximity, order, 
defining, aggregations of the same value and objects, and numbers of aggregated value 
or objects. Therefore, they eliminate the necessity to search for various relations in 
loops, delivering results in a much faster time because such features and relations are 
always available in constant time. Thanks to the aggregations of duplicates made 
during the transformation of data stored in tables into AGDS structures which often 
compress the data losslessly. The possible compression factor depends on the number 
of duplicates in raw tabular data and the types of aggregated duplicates because each 
duplicate is replaced by an extra connection that also uses some memory. The com- 
pression is treated as a side product of this transformation, but it can have a certain 
value for Big Data collections. The aggregated, counted, and sorted values for all 
attributes simultaneously allow us to compute minima, maxima, sums, averages, 
medians faster in AGDS structures than using tables as described in [7, 8]. We can also 
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quickly move to neighbor values for each attribute. This feature was a basis to define a 
new associative data model for KNN classifiers which allows to significantly limit the 
number of checked training patterns to lift the efficiency of KNN classifiers presented in 
this paper. 

The above-described features make AGDS structures a universal model for data 
and relation storing that can be adapted to many computational tasks optimizing access 
to the stored data and decreasing computational complexities of various operations. In 
computer science, we used to talk about data structures, focusing on storing data and 
optimizing the access to them. The main disadvantage of this approach is that almost all 
data relations must be searched in single or nested loops. The AGDS structures remove 
this disadvantage making data and their relations available faster and often decreasing 
the computational complexity of many operations in the way that is not achievable in 
classic tabular structures. 


Table 1. Sample data consisting of two attributes of the Iris training patterns of two classes from 
UCI ML Repository [19] used for the presentation of the introduced algorithm approach. 


leaf leaf leaf leaf leaf leaf leaf leaf 

No length width class | No length width class | No length width class |No length width class 

1 5.1 3.5 setosa | 26 5 3 setosa .2 versicolor | 76 6.6 3 versicolor 
2 49 3 setosa} 27 5 3.4 setosa $ .2 versicolor | 77 6.8 2.8 versicolor 
3 4.7 3.2 setosa |28 5.2 3.5 setosa E .1 versicolor | 78 6.7 3 versicolor 
4 4.6 3.1 setosa | 29 5.2 3.4 setosa è .3 versicolor | 79 6 2.9 versicolor 
5 5 3.6 setosa | 30 4.7 3.2 setosa E .8 versicolor | 80 5.7 2.6 versicolor 
6 5.4 3.9 setosa| 31 4.8 3.1 setosa : .8 versicolor | 81 55 2.4 versicolor 
7 4.6 3.4 setosa | 32 5.4 3.4 setosa i .3 versicolor | 82 55 2.4 versicolor 
8 5 3.4 setosa | 33 5.2 4.1 setosa y 4 versicolor | 83 5.8 2.7 versicolor 
9 4.4 2.9 setosa | 34 5.5 4.2 setosa } .9 versicolor | 84 6 2.7 versicolor 
10 49 3.1 setosa | 35 4.9 3.1 setosa e .7 versicolor | 85 5.4 3 versicolor 
11 5.4 3.7 setosa | 36 5 3.2 setosa versicolor | 86 6 3.4 versicolor 
12 4.8 3.4 setosa | 37 5.5 3.5 setosa yl versicolor | 87 6.7 3.1 versicolor 
13 4.8 3 setosa | 38 49 3.1 setosa .2 versicolor | 88 6.3 2.3 versicolor 
14 4.3 3 setosa| 39 4.4 3 setosa E .9 versicolor | 89 5.6 3 versicolor 
15 5.8 4 setosa | 40 5.1 3.4 setosa à .9 versicolor | 90 5.5 2.5 versicolor 
16 5.7 4.4 setosa|41 5 3.5 setosa E .1 versicolor | 91 55 2.6 versicolor 
17 5.4 3.9 setosa | 42 4.5 2.3 setosa i versicolor | 92 6.1 3 versicolor 
18 5.1 3.5 setosa | 43 44 3.2 setosa kl .7 versicolor | 93 5.8 2.6 versicolor 
19 57 3.8 setosa | 44 5 3.5 setosa . .2 versicolor | 94 5 2.3 versicolor 
20 5.1 3.8 setosa | 45 5.1 3.8 setosa i .5 versicolor | 95 5.6 2.7 versicolor 
21 5.4 3.4 setosa | 46 4.8 3 setosa y .2 versicolor | 96 5.7 3 versicolor 
22 5.1 3.7 setosa | 47 5.1 3.8 setosa i .8 versicolor | 97 5.7 2.9 versicolor 
23 4.6 3.6 setosa | 48 4.6 3.2 setosa E .5 versicolor | 98 6.2 2.9 versicolor 
24 5:1 3.3 setosa| 49 5.3 3.7 setosa E .8 versicolor | 99 5.1 2.5 versicolor 
25 4.8 3.4 setosa | 50 5 3.3 setosa y -9 versicolor | 100 5.7 2.8 versicolor 


Sample data presented in Table | represent typical data from ML Repository [19] 
which include many duplicates. Training patterns are numbered, aggregated (when 
defined by the same combinations of attribute values), and presented as nodes in Fig. | 
at the intersections of the attribute values which define them. The data are sorted, and 
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the duplicates are aggregated and represented by single representatives in the AGDS 
structure, where training patterns are represented by object nodes connected to value 
nodes. On this basis, it is possible to go along the axes to the nearest values and objects 
until k nearest neighbors are found. Fig. 1 presents an AGDS structure constructed for 
the sample data presented in Table 1. In this structure, horizontal nodes below 


A herlyn 
ral pend 27 / 


LEAF LENGTH 


LEAF WIDTH 


Fig. 1. Object proximity in the 2D space representing two attributes of the Iris data. The 
automatically widened sample areas (the red dotted lines) for a given classified object depicted by 
a question mark “?” are used by the AGDS structure and the introduced algorithm for searching 
for the nearest neighbors of this classified object. The IDs of the training patterns are displayed in 
the yellow and green vertices. (Color figure online) 
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represent leaf width attribute values, vertical nodes on the left represent leaf length 
attribute values, where all the same values from training data (from Table 1) where 
already aggregated and represented by the same value nodes. Training patterns are 
represented by green and yellow object nodes where the colors of nodes indicate the 
connected class labels that are also connected to the appropriate object nodes which 
define these classes. In the AGDS structure, class labels are treated in the same way as 
the values of other attributes. This approach has important significance described in [7] 
and [10]. 


3 K Nearest Neighbor Classifiers 


K Nearest Neighbor algorithms are well-known among machine learning methods. 
The idea of these algorithms is to determine which training patterns (neighbors) 
from a training set are the closest to the classified object (test pattern). The main 
challenge of this algorithm is to determine the distances of the classified object to 
training patterns in order to find the closest ones (called nearest neighbors) for the 
defined distance function. The most popular distance function is undoubtedly an 
Euclidean distance (1), but many times, it is also used a Manhattan distance (2) that can 
be faster calculated: 


d.(x,y) = as (x; - yi)” (1) 
din(x, y) = da |x; = yil (2) 


Although k Nearest Neighbor algorithms are very simple and supply us with 
usually good results, they can be very slow for large training datasets in comparison to 
other classifiers because the KNN is a lazy method which does not create a computa- 
tional model for any training dataset. To find k nearest neighbors, the whole training 
dataset must be looked through, so it takes linear time. Hence, the processing time of 
classifying a single test sample is proportional to a number of all training patterns 
stored in the training dataset [15, 17]. This paper introduces an associative model for 
storing data together with some important relations which allows reducing the com- 
putational complexity of the search for k nearest neighbors to logarithmic or even 
constant computational complexity dependently on the number of duplicates in raw 
data and the way how unique attribute values are stored and searched. The final 
computational complexity depends on the way how the unique attribute values are 
stored (sorted tables, sorted lists, hash-tables, or AVB+trees [3, 7]) and whether new 
training patterns can be added to the AGDS structure, and how many duplicates of 
values are in raw training data. The way of organizing and storing attribute values 
determines the search algorithms that can be used to get fast access to the given value 
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using a binary search algorithm, an approximation search algorithm, hash functions [3], 
or AVB+trees [7]. In this paper, lists and a modified binary search algorithm were used 
to represent attribute values. 


4 Acceleration Associative Algorithm for KNN Classifiers 


Classifiers are usually built for datasets which consist of training patterns collected in 
the past and stored in the tables. In this paper, we use AGDS structures instead of tables 
to store training patterns as well as their selected relations that are important from the 
kNN point of view. To accelerate KNN classifiers, we need to have fast access to 
nearest neighbors. AGDS structures automatically aggregate all attribute value dupli- 
cates and sort the nodes representing these aggregated values for all attributes simul- 
taneously. Hence, we have fast access to all nodes representing objects (training 
patterns) which are defined by the same or close values. The nearest neighbors are 
always represented by the training patterns which are defined by the same or close 
attribute values to the values defining classified sample. Therefore, we need to create an 
AGDS structure for a given training dataset and use the features of the AGDS struc- 
tures to move only to the nodes representing training patterns which are the closest 
from all attributes point of view in order to compute their distances from the depicted 
classified input sample. 

The introduced Acceleration Associative Algorithm (AAA) operating on ADGS 
structures describes the way how to quickly move to the nodes representing the closest 
(most similar) training patterns to the given combination of input values (classified 
object), e.g. “?” in Fig. 1. 

Assume that we have N training patterns P;, ..., Py which are defined by J 
attributes, i.e. P, = {p},...,p/}, where each attribute value p/ is a real number. 
During the construction process of the AGDS structure for the training patterns, all 


duplicated values of each attribute j are aggregated separately and represented by value 
nodes Vj, ..., Vj, that represent M; unique attribute values vj, ..., V}. Moreover, 
each value node V, contains the counter c/, that represents the number of aggregated 


duplicates of the values Pa EN Ph, that are equal to the value v/, represented by this 


m 


node (Fig. 2). Training patterns are represented by the object nodes O1, ..., Or. Each 
object node O, represents and counts up all duplicates of training patterns, where 
duplicates of training patterns mean the training patterns defined by the same attribute 
values. If there are no duplicated training patterns in the training dataset, then the 
number of the object nodes is equal to the number of training patterns R = N else 
R<N. 
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Acceleration Associative Algorithm for kNN classifiers using AGDS: 


Input: T: training data, x: classified sample, k: number of nearest neighbor 


Output: winClass: classified label of x 


BinSearchEqualOrLess (array, val) // It returns an index of the node which value 
equals to val if such a node exists, else it returns the closest lower value if 
there is such a value in the array, else the null value is returned. 
size = len(array); start = 0; end = size - 1; result = null; 
while(start <= end) 
middle = (start + end) / 2 
if (array[middle] <= x) 
start = middle + 1 
result = middle 
else end = mid - 1 


return result 


FindNextClosest (val) 
if (valueNodeLessClosest == null) and (valueNodeGreaterClosest == null) 
then 
valueNodeLessClosest = BinSearchEqualOrLess (val) 
if (valueNodeLessClosest == null) 
then valueNodeGreaterClosest = First 
return valueNodeGreaterClosest 
else if (valueNodeLessClosest.Val == val) 
then valueNodeGreaterClosest = valueNodeLessClosest 
return valueNodeGreaterClosest 
else if (valueNodeLessClosest.IsNotMax) 
then valueNodeGreaterClosest = valueNodeLessClosest.Next 
if (val-valueNodeLessClosest.Val < valueNodeGreaterClosest.Val- 
val) 
then return valueNodeLessClosest 
else return valueNodeGreaterClosest 
else valueNodeGreaterClosest = null 
return valueNodeLessClosest 
else if (val - valueNodeLessClosest.Val < valueNodeGreaterClosest.Val - val) 
then if (valueNodeLessClosest.IsNotMin) 
then valueNodeLessClosest = valueNodeLessClosest. Prev 
else if (valueNodeGreaterClosest.IsNotMax) 
then valueNodeGreaterClosest = valueNodeGreaterClosest.Next 
else return null 
else if (valueNodeGreaterClosest.IsNotMax) 
then valueNodeGreaterClosest = valueNodeGreaterClosest.Next 
else if (valueNodeLessClosest.IsNotMin) 
then valueNodeLessClosest = valueNodeLessClosest. Prev 
else return null 
if (val - valueNodeLessClosest.Val < valueNodeGreaterClosest.Val - val) 
then return valueNodeLessClosest 


else return valueNodeGreaterClosest 
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FindkNN (k, x) 
valueNodeLessClosest = null 
valueNodeGreaterClosest = null 
create empty rankList of k object pointers and distances 
// Create the rankList of k nearest objects to the classified sample x 
do valueNode = AGDS.Attributes[0].FindNextClosest (x[0]) 
foreach objectNode connected to valueNode 
d = calculateDistance (objectNode, x) 
if d< rankList.LastDistance then 
rankList.InsertInAscendingOrder (objectNode) 
if (rankList.Count > k) then 
while (rankList.LastDistance > rankList[k-1].Distance) 
rankList.RemoveLast 
while (x[0] - valueNodeLessClosest.Val <= rankList.LastDistance) and 
(valueNodeGreaterClosest.Val - val <= rankList.LastDistance) 
// Determine the winning class for the classified sample 
foreach objectNode in ranklist 
countLabels[objectNode.ClassLabel] += 1 
if (countLabels[objectNode.ClassLabel] > countMax) 
then countMax = countLabels[objectNode.ClassLabel] 
winClass = objectNode.ClassLabel 
else if (countlabels[objectNode.ClassLabel] == countMax) and 
(objectNode.ClassLabel != winClass) then winClass = null 


return winClass 


The AAA algorithm described above is run for each classified object to quickly 
determine its k nearest neighbors that are necessary for a used KNN method to classify 
this object. This algorithm can be combined and used with any variation of this method. 
Its role is to supply the selected KNN method with k nearest neighbors faster than 
looking through all training patterns. This new combination of AGDS structures with 
KNN classifiers was called a KNN+AGDS classifier. 


5 Comparison of Results and Efficiencies 


The results of the implemented AAA algorithm operating on AGDS structures together 
with the KNN classifiers on various training data are shown in Table 2 and Fig. 2. The 
classification times shown in Table 2 are means from one hundred of single classifi- 
cation time. The presented approach can successfully accelerate k Nearest Neighbor 
classifiers and work as an eager data-relation model for them. 

In this paper, all datasets used in the described experiments came from the UCI 
Machine Learning Repository [19], and the results obtained for these datasets are 
presented in Table 2. For the experiments, we used datasets with various numbers of 
instances, various numbers of attributes, and various numbers of duplicates to objec- 
tively show differences between the average classification time of an input instance 
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Table 2. Comparison of classification time using KNN and KNN+AGDS. 


Dataset Number Number kNN kNN+AGDS kNN+AGDS 
of of classification classification construction 
instances | attributes | time [ms] time [ms] time [ms] 

Tris 150 4 0.10 0.08 1 

Banknote 1372 4 0.29 0.09 5 

HTRU2 17898 8 3.14 0.09 134 

Shuttle 43500 9 7.67 1.06 278 

Credit 30000 23 8.69 1.07 499 

Card 

Skin 245057 3 26.87 1.10 683 

Drive 58509 48 46.15 1.24 2224 

HEPMASS 1048576 28 362.32 1.41 31214 

= —e@—kNN —e—kNN+AGDS 
2 100 

E 10 

S 1 
g 

= 0,1 
E 

© 0,01 


Fig. 2. Classification time as a function of the number of the instances multiplied by the number 
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number of instances * number of attributes 
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of attributes, i.e. the number of data stored in the training data tables. 


6000000 


processed by KNN and KNN+AGDS classifiers. The presented combination of AGDS 


structures with kKNNs becomes to be an eager solution instead of a lazy one as for 


classic KNNs because this new classifier creates a model using an AGDS structure. 


Thus, the use of AGDS structures removed the main inconvenience of the classic KNN 


classifiers. 
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Fig. 3. Memory usage ratio of using AGDS structures to arrays for various training data. 
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The results shown in Fig. 3 confirm compression ability of AGDS for some 
training datasets. The most of the datasets used in this work using AGDS lead to the 
significantly shorter classification time and lower memory usage. The compression was 
not achieved for the three from eight datasets in Fig. 3 due to the small number of 
duplicated values in these datasets and the memory used for the representation of the 
connections. On the other hand, the small extra memory usage is compensated by the 
much higher speed of classification. For sorting purposes, the quicksort algorithm was 
used. 


6 Conclusions and Final Remarks 


This paper did not focus on improving classification results of KNN classifiers but on 
the efficiency of their use implementing a new Associative Acceleration Algorithm 
operating on Associative Graph Data Structures. The computational complexity of the 
presented algorithm is independent of the number of training patterns, so it always 
works in constant time (when k is much smaller than the number of all training 
patterns) for a given k of searched nearest neighbors using associations between 
attribute values defining training patterns. However, the number of computations 
depends on the density of training patterns in hyperspace that is close to the classified 
samples. Hence, the efficiency of the presented algorithm grows with the amount of 
training data. Therefore, it is convenient to use for classification for Big Data collec- 
tions. It was shown that AGDS structures could be used as a data-relation model for 
KNN classifiers, and thanks to the use of this model we can accelerate classification of 
various k Nearest Neighbor algorithms. The efficiency of the presented algorithm and 
AGDS structures is significant for big training datasets especially when they contain 
many duplicated values that define training patterns. The construction of an AGDS 
structure for a training dataset can be treated as an adaptation process that develops a 
computational model for KNN classifiers because AGDS structures contain all training 
data enhanced by extra useful relations quickly available for KNN classifiers. In the 
future studies, we will consider further improvements of the introduced approach 
concerning the use of AVB+trees [7] to even more accelerate the access to attribute 
values when searching for the closest values of the classified samples. This work was 
supported by AGH 11.11.120.612. 
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Abstract. We are living in an era that we can call machine learning 
revolution. Started as a pure academic and research-oriented domain, 
we have seen widespread commercial adoption across diverse domains, 
such as retail, healthcare, finance, and many more. However, the usage of 
machine learning poses its own set of challenges when it comes to explain 
what is going on under the hood. The reason being models interpretabil- 
ity is very important for the business is to explain each and every decision 
being taken by the model. In order to take a step forward in this direction, 
we propose a principled algorithm inspired by both preference learning 
and game theory for classification. Particularly, the learning problem 
is posed as a two player zero-sum game which we show having theo- 
retical guarantees about its convergence. Interestingly, feature selection 
can be straightforwardly plugged into such algorithm. As a consequence, 
the hypotheses space consists on a set of preference prototypes along 
with (possibly non-linear) features making the resulting models easy to 
interpret. 


Keywords: Game theory - Margin maximization - Classification 
Preference learning 


1 Introduction 


Machine learning and intelligent systems in general are becoming increas- 
ingly ubiquitous. However, after a first enthusiastic reaction to these seemingly 
unbounded technologies, nowadays some concerns start to rise regarding the 
black box nature of these methods. There are many examples of applications in 
which the explanation plays a key role, such as recommender systems, bioinfor- 
matical applications and support systems for physicians. The need of explana- 
tions is also theme of the Article 22.1 ofthe General Data Protection Regulation 
which states that: “The data subject shall have the right not to be subject to a 
decision based solely on automated processing”. Unfortunately, most of the state- 
of-the-art machine learning approaches are based on highly non-linear optimiza- 
tion problems which are not very suited for being interpreted. A glaring example 
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are deep neural networks (DNNs). Despite in many applications DNNs are one 
of the most successful approach, they represent a really hard challenge when 
dealing with model interpretation. Similar considerations can also be done for 
theoretical grounded methods such as Support Vector Machines. 

In this work we present a principled algorithm inspired by game theory and 
preference learning for classification. Specifically the learning problem is seen as a 
zero-sum game between the nature and a learner. Interestingly, feature selection 
can be easily plugged into the same algorithm in a natural way. The hypotheses 
space consists on a set of preference prototypes attached with possibly non-linear 
features making the interpretation and visualization of the resulting models very 
easy. 


2 Background 


2.1 Preference Learning 


Broadly speaking, a preference learning (PL) task consists of some set of items 
for which (some) preferences are known, and the task is to learn a function able 
to predict preferences for previously unseen items, or other preferences for the 
same set of items [5]. In the context of PL, three different ranking tasks can be 
defined, namely label ranking, instance ranking, and object ranking. In this work 
we focus on label ranking which can be defined as follows: given a set of input 
instances x; € Y, i € [1,...,n], and a finite set of labels Y = {y1, yo,---, Ym} 
the goal of a ranker is to learn a scoring function fg : Y x Y — R which assigns 
for each label y; a score to a pattern x. Hence, a label ranking task represents 
a generalization of a classification task, since fg implicitly defines a full ranking 
over Y for an instance x. In PL, the training set usually consists of a set of 
pairwise preferences of the form Y; >x yj, which means that, for the instance 
x, Yi is preferred to yj. In particular, in the case of classification, in which each 
pattern x is associated to a single label y;, the following set of preferences are 
implicitly defined {y; >x yj |1 <j #i<m}. 

Formally, fọ has the following form [1]: fo(x,y) = wTy(x, y), where y : A x 
Y — R?” is a joint representation of item-label pairs, X = R4, Y = {1,..., m}, 
and w is a weight vector. Since fg has to properly rank the labels for each item, 
given a preference Y; >x Yj then fo(x, yi) > fo(x, yj) should hold, that is, 


w(x, yi) > wTh(x, yj) 2 wl (v(x, yi) = W(x, y;)) > 0, 


which can be interpreted as the margin (or confidence) of the preference. Higher 
the confidence, higher the generalization capability of the obtained ranker. Thus, 
given a preference Y; >x yj; we construct its corresponding representation by 
z= U(x, yi) — U(x, yj), with ze Rem. 

We assume that the item-label joint representation is defined as 


(x,y) =x0 ey 


=(0,0,...,21,0,..., 22 ,..., Ti ER La ong); 
T T T li 
1 2 y ytm y+(i-1)m y+(d-1)m 
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where ey’ is the y-th vector from the canonical basis of R™. We indiciate the 


f-th d-dimensional chunk of a preference z with 
z|f] = (24-—13m> 2 4-1)m410+>->2%fm) ER”. 


Similarly, we define z[y] = (zy, 244m» - - +) 2y+(4-1)m) € R. 

At classification time, given a new example x the predicted class 7 is com- 
puted by selecting the label which mazimizes the value of the scoring function 
fo, that is, ĝ = argmaxyey fo(x, y). 


2.2 Game Theory 


Game theory studies the problem of making strategic decisions in competitive 
environments. In this paper, we focus on two players zero-sum games. The strate- 
gic form of a two-player zero-sum game is defined by a matrix M (the game 
matrix). The two players, the row player P and the column player Q, play the 
game simultaneously. In particular, the row player selects a row and the col- 
umn player selects a column of M € RP*9, where P and Q are the number of 
available strategies for P and Q respectively. Each entry M; j represents the loss 
of P, or equivalently the payoff of Q, when the strategies ¿ and j are played by 
the two players. The player P aims at finding a strategy minimizing its expected 
loss (the value of the game) V, while the player Q aims at finding a strategy 
maximizing V, its payoff. The strategies of the players are typically randomized, 
meaning that the player P selects a row according to a distribution p over the 
rows, and the player Q selects a column according to a distribution q over the 
columns. These distributions are referred to as the mixed strategies of players 
P and Q, respectively. The vectors p and q can be thought of as stochastic vec- 
tors, that is pe /p and q € %q, where /p = {p € RẸ | ||pllı = 1} and 
Faq = (a ERẸ | llalla = 1}- 

It is well known [11] that for any game matrix M there exists a saddle-point, 
that is a pair of optimal strategies p* and q* for the two players such that 


V = p*'Mg* = min max p™Mgq = max min p’Mq 
P q a P 


A saddle-point of this type can be computed by solving an appropriately defined 
linear program with a number of variables and constraints growing linearly with 
the number of (pure) strategies of the two players. It is clear that this computa- 
tion becomes prohibitive for game matrices of high dimensionality. An alternative 
method called adaptive multiplicative weights (AMW) [3,4] has been proposed to 
compute approximate optimal saddle-point values and strategies. The ficticious 
play (FP) strategy, also called Brown-Robinson learning process, introduced by 
Brown in the 50’s [2], is a simple algorithm to efficiently compute an approxima- 
tion of the solution of a game. The FP method starts with an arbitrary initial 
pure strategy for P. Then, each player in turn chooses his next pure strategy as 
a best response assuming the other player chooses among his previous choices 
at random equally likely. In other words, at each step each player tries to infer 
the mixed strategy of the opponent from its previous choices. The pseudo-code 
of FP is reported in Algorithm 1. 
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Algorithm 1. FP: The Fictitious Play algorithm 
Input: 

M € R?*9: matrix game 

Te: number of iterations 
Output: 

p, q: row/column player strategy 

V: the value of the game 
r —randint[1, P] 
Sp, Vp — 0,0 
Sq, Va — M,..,e7 
for t — 1 to Te do 

q — argmaxSg, Sp Sp+M.g, Vg Va + 


DD O 


e 
p<argmaxSy, Sq — Sq + Mb,:, Vp Vp+e 
end 

P= vo/livplli, q = va/llvallı 

V — p'Mq 

return p,q, V 


so oo A VNB 


m 
o 


3 A Game Theoretic Perspective of Preference Learning 


In this section we describe a new learning approach for label ranking based on 
game theory. Specifically, we assume to have a set of training preferences of the 
form p; = (y+ >x y_) which can be converted in their vectorial representation 
z; as described in Sect. 2.1. We consider an hypothesis space of linear functions, 
that is, F = {fw(z) : z > wTz | w,z € R*”, ||wl|2 = 1). Given any preference 
vector z then, for the preference to be satisfied, wTz > 0 should hold. The 
margin of the preference p(z) = wz will represent the confidence of the current 
hypothesis over the preference z. According to the maximum margin principle, 
our aim is to select the hypothesis (w) that maximizes the minimum margin 
over the preferences of the training set. 

From the Representer Theorem (see e.g. [7, 9]) we know that w can be defined 
as a convex combination of a subset of the training preferences, that is w œ 
y ¡jz;5, a € Sp. Thus, the margin of a preference can be expressed as 


plz) =X aziz = X a; X mzzl >X gonz lz] 
j 3 f ( 


IN 


where the dot product 212 is generalized by giving different weights to the 
features according to a distribution a over the features, and q such that 
df) = AH is a new distribution over all the possible preference-feature pairs. 

Now, let p be a distribution over the set of training preferences, the expected 
preference margin when the preferences are drawn according to p is: 


p(p, q) = Sp; 5 gg, Zl A zilf = p'Mgq (1) 
i GF) 


t 
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where M; (jf) = zilf]'z;[f]. This formulation highlights the relation between a 
preference learning problem and game theory. Consider a two-player zero-sum 
game in which the row player P (the nature) chooses a distribution over the 
whole set of training preferences as its mixed strategy aiming at minimizing the 
expected margin. Simultaneously, the column player Q (the learner) chooses a 
distribution over the set of hypothesis, or preference-feature pairs, as its mixed 
strategy aiming at maximizing the expected margin (its payoff). Then the value 
of the game, that is the maximal minimum margin solution will be: 


V = p(p*,q*) = minmax p’Mq 
P q 


4 Approximating the Optimal Strategies 


The game matrix M has number of rows equal to the number of training pref- 
erences P, and number of columns equal to Q = P - F where F is the number of 
features. The number of preference-feature pairs can be huge, thus solving the 
game using standard off-the-shelf methods from game theory is impractical. 

For this, in this section we propose a new method to solve the game incre- 
mentally. The pseudo-code of the algorithm is given in Algorithm 2. Specifically, 
given a game matrix M and the optimal solution (p*,q*,V*) for the game. At 
each iteration we only consider a subset of columns of the entire matrix, that is 
M: = MIT, where II; € (0, 1}!@!* are left-stochastic (0, 1)-matrices, i.e. matri- 
ces whose entries belong to the set {0,1} and whose columns add up to one. Let 
(př, qž, V;*) be the solution for the matrix M;. The columns of M; correspond- 
ing to null entries in q; are replaced by new columns drawn randomly from M. 
We show in the following that the value of the game obtained at each iteration 
increases monotonically and it is upper bounded by the optimal margin. 

At the iteration t + 1 a new left-stochastic (0, 1)-matrix IL}ı is considered 
which is IL; where every column corresponding to null entries in qj are substi- 
tuted with a new random stochastic vector e@ for a random pair r. Thus, it can 
be shown that 


Ve =p}M.q; = p; MILq; 
< pry Mig; 
= P¿+1MITL 197 
< Pi Mnd = Vey 
and V* < PMA < p*Mq* = V* for every t. 


qt 
5 Evaluation 


In this section we describe the experiments performed to assess the effectiveness 
of our proposal. We performed two different sets of experiments. The first set 
aims to assess the proposed algorithm in terms of interpretability. The second one 
is focused on the performance comparison between our method and a standard 
SVM. 
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Algorithm 2. Proposed algorithm 
Input: 
P: set of training preferences 
genF : random feature generator 
B: size of the working set 
T: number of epochs 
Te: number of iterations of Fictitious Play (FP) 
Output: 
Q: working set of preference-feature pairs 
q: mixed strategy of preference-feature pairs in Q 


1 random initialization of the set Q such that |Q| = B 

2 compute the matrix game M on the basis of P (rows) and Q (cols) 
3 fort 1 to T do 

4 p, q; v — FP(M, Te) 

5 if t < T then 

6 foreach (j, f) | ac; yy = 0 do 

7 (5, F’) — pick(P), genF() 

8 update Q: replace (j, f) with (j’, f’) 

9 update columns of M w.r.t. Q: 


10 let k the position of (j’, f’) in Q, 
11 for all i € P, Mix = zi[ f] z; [f] 
12 end 

13 end 

14 end 


15 return q, 9 


5.1 Model Interpretation 


In the first set of experiments, we employed our algorithm to select the most 
relevant features in order to interpret the model. The aim is to use these fea- 
tures to explain the decision. We run the method on four benchmark datasets, 
three of which (namely, tic-tac-toe, monks-1 and monks-3) have a specific 
logical rule that explains the positive class. The tic-tac-toe dataset has been 
converted into a binary-valued dataset through one-hot encoding. The remain- 
ing dataset, i.e., mmist-49, is composed with instances of the handwritten digit 
dataset mnist concerning only the classes 4 and 9. In this case, the extracted 
features are used as a visual aid in order to highlight the points of interest that 
the model uses to discriminate a 4 from a 9, and viceversa. The details of the 
datasets are reported in Table 1. Table 2, instead, shows the logical rules which 
explain the positive class. It is noteworthy that the difference between monks-1 
and monks-3 is only the instance labelling. Even though both these datasets 
have been artificially created, their coverage w.r.t. the associated explanation 
rule is not complete. These experiments have been performed using the follow- 
ing procedure. We trained our model by using polynomial features of different 
degrees [1,...,3]. This choice depends on the fact that all the interested rules 
are expressed in terms of disjunctive normal form formulas with conjunctive 
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Table 1. Datasets information: name, number of instance, number of features, and 
class prior. All the dataset are freely available in the UCI repository [10]. 


Dataset #Instances | #Features | Class prior 
tic-tac-toe 958 27 65/35 
monks-1 432 17 50/50 
monks-3 432 17 53/47 
mnist-49 13782 784 50/50 


Table 2. Logical rules which explain the positive class of the datasets. The variable 
x; indicates the i-th input feature of a vector in the corresponding dataset. 


Dataset Rule 


xg A £14 A £20) V (£5 A £14 A X23) V (£2 A £14 A £26)V 
xg A £17 A £26) V (211 A T14 A £17) V (£2 A £11 A £20)V 


tic-tac-toe | ( 
( 
(£20 A £23 A £26) V (£2 A a5 A 28) 
( 
( 


monks-1 zo A x3) V (11 A z4) V (42 A 25) 


213 A Lg) V (xs N 7214) 


monks-3 


clauses up to the arity 3. In the case of binary valued data, polynomial features 
correspond to conjunctions, and hence they are suited for our purposes. Since 
we have a-priori knowledge about the game tic-tac-toe, for this specific dataset 
we only used the polynomial of degree 3. 

Table 3 shows the 10 most relevant features for each tested dataset for each 
polynomial. Features are sorted with respect to their corresponding weight in 
the solution. It is evident from the table that the retrieved best features are 
the ones involved in the explanation rules (Table 2). In tic-tac-toe the first 
8 polynomial features correspond to the rule which describe the wins of the 
crosses. The remaining two features represent a single naught in the central and 
in the bottom right cell. These features are useful to discriminate a win for the 
naught, which is reasonable in particular for the central cell which is actually 
one of the most useful square to get a three-in-a-row. Despite the tic-tac-toe 
case in which the polynomial was suited for the specific set of rules, in the case 
of monks we tried all polynomials up to the degree 3. Nonetheless, in these 
cases the algorithm managed to retrieve the right features regardless of the 
used polynomial. For example, in monks-1 with d = 3 the first three features 
contain repeated variables, which means that the features are actually of degree 
2 (x = x° if x € {0,1}). Same considerations can be done for monks-3. Moreover, 
in monks-3 the algorithm has also been able to correctly identify the polarity 
of the features. In fact, x14 and x; contribute to distinguish the negative class 
w.r.t. the positive which reflect the = logical operator in the explanation rule. 

The mnist-49 dataset has been used as a more realistic use case since there 
are not simple rules that govern the classification. The goal here is to use the 
most relevant features for interpreting, in a human fashion, which are the visual 
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Table 3. Logical rules which explain the positive class of the datasets. The variable 
x; indicates the i-th input feature of a vector in the corresponding dataset. (-) means 
that the feature discriminate the negative class from the positive one. The column d 
indicates the degree of the used polynomial. 


Data|d|R1 R2 R3 R4 R5 R6 R7 R8 R9 |R10 
t-t-t 3 [28117826 |w2011 820 | F201 472g |T8T14T20 %290%23026|711 014017 | 72508 To 014023 83" |235" 
mks-1/1) 2 4: 213° 212° 25 z1 T3 Tg zo T10 [Ta 
mks-1|2|2713 2124 zats al, x24: xy: 213° mig 275: a2. 
mks-1|3 Loxa mima aque 210% a3: ajo" aa xe. z3. ats: 
mks-3|1/2 14° 5 213 8 23 Za T2 vi zo |12 
mks-3|2|13113 224: z2. «2 a2 x? x25 ula z? z2 
nks-3|3 25073 zia: z5: ej 23 ada ey eg zis |2326 


characteristics that are leveraged by the model to discriminate the two classes. 
Also for this dataset all polynomials up to the degree 3 have been considered. 
The best results in terms of accuracy have been achieved by the degree 2. For 
this reason and also for visualization purposes we used the degree 2 features to 
build Fig. 1. The figure shows the most relevant poly 2° features used by the 
model to distinguish a 4 from a 9 (left) and viceversa (right). The features are 
represented as segment between the two involved variables (i.e., pixels) in each 
monomial. In the background is depicted the average digits. From the figure it is 
clear that there are huge differences in how the two numbers are discriminated. 
In the 4 vs 9 case (left), the region of interests is the left (almost) vertical line of 
the 4. In particular, each pair of pixels involved in the features seems to follow 
the gradient of this line. Similarly, in the 9 vs 4 case (right), the region of interest 


Fig. 1. Visualization of the most relevant polynomial features of degree 2. The poly- 
nomial features are visualized as segments limited by the involved input variables. The 
left hand side plot shows the features relevant to discriminate the 4 from the 9, vicev- 
ersa in the right hand side plot. In the plots, the opacity of the visualized feature is 
exponentially related to its weight. 
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is represented by the left curvature of the circle of the 9. In this case, the pair 
of pixels seems to follow the border of such curvature. It is also interesting to 
notice that some of the relevant pixels are outside the grey region. This can be 
explained as a way to deal with outliers. 


5.2 Feature Selection 


This set of experiments aims to assess the effectiveness of the proposed algorithm 
on datasets with many noisy and redundant features. The chosen testbeds have 
been the datasets of the NIPS 2003 Feature selection challenge [6]. The details 
of these datasets are reported in Table4. An important observation about the 
datasets is the huge number of features w.r.t. the number of training instances. 
In addition to the number of features of each datasets, Table 4 reports the actual 
number of real features. For more details about the datasets please refer to [6,8]. 
All datasets concern binary classification tasks. 


Table 4. Datasets information: name, number of instance, number of features, number 
of relevant features (probes), and class prior. All the dataset are freely available at 
the NIPS 2003 Feature selection challenge site, http://clopinet.com/isabelle/Projects/ 
NIPS2003/. 


Dataset | #Instances | #Features | # Real feat. | Class prior 
arcene 200 10000 7000 44/56 
dexter 600 20000 9947 50/50 
dorothea | 1150 100000 50000 90/10 
gisette | 7000 5000 250 50/50 
madelon | 2600 500 20 50/50 


We compared the proposed algorithm with a standard SVM. Given the huge 
number of features of the target datasets, the usage of higher degree polynomial 
kernels was not effective. For this reason, the reported results have been obtained 
using the linear kernel. The C parameter of the SVM has been validated in the set 
of values [107*,..., 10°} using a 5-fold cross validation procedure. The reported 
results are the average over 10 runs of the experiments over different data splits. 
Table 5 summarizes the achieved results. The size B of the working set has been 
set to 2000. 

As evident from the table, the proposed method is able to achieve comparable 
and sometimes better performances than SVM. It is worth to mention that, since 
the working set had size of 2000, generally the number of used features by our 
algorithm was order of magnitude less than the number of original features. 
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Table 5. Accuracy results achieved by SVM and the proposed algorithm. The last 
column indicate the number of support preference-feature pairs used by the proposed 


algorithm. 
Dataset |SVM | Proposal # Relevant feat. 
arcene 90.00 | 88.33 125 
dexter 92.78 | 91.11 260 
dorothea | 91.88 | 93.04 468 
gisette | 96.71 | 97.05 1056 
madelon | 60.10 | 60.10 1448 
6 Conclusions and Future Work 


We 


proposed a principled algorithm for classification inspired by preference 


learning and game theory. Empirical evaluations have shown the feasibility of 
efficiently making non linear feature selection. Moreover, we have shown how it 
is possible to leverage the selected (possibly) non linear features to interpret the 
resulting model. In the future we plan to adapt the proposed method to very 
large scale datasets. 
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Abstract. Security incident tracking systems receive a continuous, unlimited 
inflow of observations, where in the typical case the most recent ones are the most 
important. These data flows and characterized by high volatility. Their charac- 
teristics can change drastically over time in an unpredictable way, differentiating 
their typical normal behavior. In most cases it is not possible to store all of the 
historical samples, since their volume is unlimited. This fact requires the 
extraction of real-time knowledge over a subset of the flow, which contains a 
small but recent percentage of all observations. This creates serious objections to 
the accuracy and reliability of the employed classifiers. The research described 
herein, uses a Dynamic Ensemble Learning (DYENL) approach for Data Stream 
Analysis (DELDaStrA) which is employed in RealTime Threat Detection sys- 
tems. More specifically, it proposes a DYENL model that uses the “Kappa” 
architecture to perform analysis of data flows. The DELDaStrA is based on the 
hybrid combination of k Nearest Neighbor (KNN) Classifiers, with Adaptive 
Random Forest (ARF) and Primal Estimated SubGradient Solver for Support 
Vector Machines (SVM) (SPegasos). In fact, it performs a dynamic extraction of 
the weighted average of the three results, to maximize the classification accuracy. 


Keywords: Dynamic ensemble learning - Big data - Data streams analysis 
“Kappa” architecture - Critical infrastructure - Real-time threat detection 


1 Introduction 


The data created by SCADA [31] and more generally by Industrial Control Systems 
(ICS) [20], has caused an exponential increase of the obtained information. This fact has 
led to the adoption of architectures which incorporate proper algorithms for real-time 
data stream processing. These algorithms are dynamically adjusted by new models or 
when the data are produced as a function of time [5]. The “Kappa” architecture uses a 
real-time engine and it is the most suitable approach for the analysis of data flows [25]. 
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For each new sample, a small gradual update of the model takes place, which 
gradually improves as more data arrive. The error in the real-time engine is calculated 
at each iteration as data characteristics can change drastically and in an unpredictable 
way. This changes the typical, normal behavior, and an object that may have been 
considered extreme, can be included in the normal observations, due to rapid devel- 
opments in the data stream (Fig. 1). 


--- - -—-.. -- - - - -.\ 
I Streaming Layer t Serving Layer 
I 


5 Serving 
= Backend 


Fig. 1. Kappa architecture (https://www.oreilly.com/ideas/applying-the-kappa-architecture-in- 
the-telco-industry) 


Due to the unlimited volume of data, data mining is performed on a subset of the 
flow, which is called a sliding window (SLWI). Clearly the SLIWI contains a small but 
recent percentage of the observations included in the global set. The goal of these data 
processing algorithms is to minimize the cumulative error for all iterations, which can 
be calculated by the following function (1) [2]: 


n 


Llw] = 5 V((w,x;),y;) = Nr _ y) (1) 


j=l j=l 


where x; € Rd, w € Rd and y; € R supposing that Xi x d is a data matrix and Yi x 1 is 
a target values vector, obtained after the arrival of the first i data points. If we accept 
that the covariance matrix Xi = XTX is reversable, the optimal solution f*(x) = 
(wx, x) is given by the following function (2): 


w= (XTX) XTT spr. oa (2) 
j=l 
i T 


If we estimate the covariance matrix X; =X. xx; the time complexity 


j=l 
(TC) changes from O (id?) (d x d) and it becomes O(d*), whereas the rest of the 
multiplication requires TC equal to O(d?). Thus, the TC finally becomes equal to 
O(id? + d?). If n is the number of points in the dataset, it is necessary to recalculate the 
solution after the arrival of each new data point i= 1,2,...,n. So, the final time 
complexity is of the order O(n?d? + nd?) which would make the algorithm unsuitable 
for application in demanding fast changing environments such as the one under con- 
sideration [2, 24]. It is therefore important to note that in-stream processing is subject to 
time constraints, as applications require explanatory results in real time, and there are 
also significant memory requirements. 
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It is clear from the above, that a secure approach for data flow mining problems, 
requires robust systems characterized by reliability and high accuracy rates, without a 
demand for high resources availability. Good preparation and methodological deter- 
mination of their operating parameters is needed, to avoid long-term convergence, or 
undesirable fluctuations in accuracy, which may be associated with frequent model 
updates and instability or loss of generalization, which may be due to corrupted and 
noisy data. 


1.1 Literature Review 


Soft computing techniques are capable to model and detect cyber security threats [6-14] 
and they also offer optimization mechanisms in order to produce reliable results. In 
many applications, learning algorithms have to act in dynamic environments where data 
are collected in the form of transient data streams. Krawczyk et al. [21] investigated 3 
data stream classification as well as regression tasks. Besides presenting a compre- 
hensive spectrum of ensemble approaches for data streams, authors also discussed 
advanced learning concepts such as imbalanced data streams. According to Liu et al. 
[26] a weight computation policy based on confidence was presented to deal with the 
problem in the sub-classifier’s weight in dynamic data stream ensemble classification. 
The policy fully considers influence of the sample on the weight of the sub-classifier. 
Krawczyk and Cano [22] introduced a dynamic and self-adapting threshold that was 
able to adapt to changes in the data stream, by monitoring outputs of the ensemble to 
exploit underlying diversity in order to efficiently anticipate drifts. 

Nowadays, the intrusion detection systems (IDS) have become one of the most 
important weapons against cyber-attacks. Chand et al. [4] performed a comparative 
analysis of SVM classifier’s performance when it was stacked with other classifiers like 
BayesNet, AdaBoost and Random Forest. Ahmin and Ghoualmi-Zine [1] used two 
different classifiers iteratively, where each-iteration represented one level in the built 
model. To ensure the adaptation of their model, authors added a new level whenever 
the sum of new attacks and the rest of the training dataset reached the threshold. 

Data mining in non-stationary data streams is gaining more attention recently, 
especially in the context of Internet of Things and Big Data. Losing et al. [27] proposed 
the Self Adjusting Memory (SAM) model for the k-Nearest Neighbor (k-NN) algo- 
rithm since k-NN constitutes a proven classifier within the streaming setting. SAM- 
KNN could deal with heterogeneous concept drift, i.e. different drift types and rates, 
using biologically inspired memory models and their coordination. Rani and Sumathy 
[28] used k-NN algorithm to determine the best optimal subset. 

There are a few researches about Primal Estimated sub-Gradient Solver for SVM 
(Pegasos) algorithm. Shalev-Shwartz et al. [29] described and analyzed a simple and 
effective stochastic sub-gradient descent algorithm for solving the optimization prob- 
lem cast by SVM. Their algorithm was particularly well suited for large text classifi- 
cation problems, where authors demonstrated an order-of-magnitude speedup over 
previous SVM methods. Farda [18] explored machine learning in Google Earth Engine 
and its accuracy for multi-temporal land used mapping of coastal wetland area. 
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1.2 Datasets 


Appropriate datasets were chosen that closely simulate ICS communication and 
transaction data. They were used in the development and evaluation of the proposed 
model. The following preprocessed network transaction data, and preprocessed to strip 
lower layer transmission data, were used in this research (e.g. TCP, MAC) [15]: 


e The water_tower_dataset includes 23 independent parameters and 236,179 instan- 
ces, from which 172,415 are normal and 63,764 outliers. Totally 86,315 normal 
instances were used in the training phase (water_train_dataset) whereas the 
water_test_dataset comprised of 86,100 normal instances and 63,764 outliers. 

e The gas_dataset includes 26 independent features and 97,019 instances, from which 
61,156 normal and 35,863 outliers. Training of the algorithm was done with the 
gas_train_dataset that contains 30,499 normal instances, whereas the gas_test_- 
dataset comprises of 30,657 normal instances and 35,863 outliers. 

e Finally, the electric_dataset includes 128 independent variables with 146,519 
instances, from which 90,856 normal and 55,663 outliers. The training was per- 
formed 4 based on the electric_train_dataset comprising of 45,402 normal instances, 
whereas the rest 45,454 normal and the 55,663 outliers, belong to the 
electric_test_dataset. 


More details regarding the dataset and their choice can be found in [15]. 


2 Proposed Dynamic Weighted Average Methodology 


This research proposes an intelligent and dynamic Ensemble Machine Learning system 
(EMLS) [32] aiming to develop a stable and accurate framework, which will have the 
ability to generalize. The EMLS employs an innovative version of the “Kappa” 
architecture that combines the ARF, SPegasos and k-NN SAM algorithms. DEL- 
DaStrA performs real time analysis and assessment of critical infrastructure data, in 
order to classify and identify undesirable digital security situations, related to cyber- 
attacks. The reason for using the ensemble approach, is the multivariance that usually 
appears in such multifactorial problems of high complexity, due to the heterogeneity of 
the data flows. This is a typical case of digital security and critical infrastructures. 

The two most important advantages of the Ensemble Techniques focus on the fact 
that they offer better prediction and more stable models, as the overall behavior of a 
multiple model is less noisy than a corresponding single [23]. Also, an Ensemble 
method can lead to very stable prediction models, while offering generalization. 
Finally, these models can reduce the bias, the variance, and they can avoid overfitting 
[17] producing robust learning models. 

Three classifiers were employed in the development of this model (Ensemble Size). 
The number of the classifiers was determined after considering the law of diminishing 
returns in ensemble construction in a trial and error approach. The applied algorithms 
were chosen based on their different decision-making philosophy and methodology to 
address the problem, in order to cover the number of possible cases associated with the 
tactic of attacks against critical infrastructure. In general, the choice was based on both 
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static tests combined with the trial and error method, but also on the basic properties of 
these algorithms regarding the way they handle each situation. 

More specifically, the following approaches were used: SVM non-parametric 
models due to the way they handle outliers. Random Forests which are using subsets of 
the training sets with bagging, and subsets of features that favor the reduction of the 
outliers’ or extreme values’ effect. The k-NN classifier is automatically non-linear, it 
can detect linear or non-linear distributed data and it tends to perform very well with a 
lot of data points. Also, the choice of the algorithms was based on the diversity of their 
operation and parameterization (Reliability of Ensemble) which is achieved with dif- 
ferent architectures, hyper-parameter settings and training techniques. The weights’ 
determination of the different models of the Ensemble, was based exclusively on static 
trial and error tests [16]. 

The DELDaStrA operation mode, includes the parallel analysis of the data flow by 
all three algorithms and the dynamic extraction of the weighted average of the three 
results. More specifically, each data flow is checked by each algorithm and the clas- 
sification accuracy is obtained. Then the maximum accuracy isincreased by a weight 
equal to 0.6 whereas in the rest of the forecasts this weight is equal to 0.2 and the 
weighted average is calculated. This process is presented in the pseudocode of the 
following Algorithm 1. 


Algorithm 1. Dynamic Weighted Average 
Input: X,, X», X, /* classifier accuracy 
Step 1: if ((x, > x) && (x, > x;)) 
max = x,; else if(x, > x;) 
max = X; else max = Xy; 
Step 2: Set Wy=0.6, w,=0.2 and w,=0.2 
Step 3: Calculate x = 22 mae Wee 


W1+W2+Wmax 
Output: The dynamic weighted average of classification accuracy 


The use of the weighted average potential significantly enhances the visualization 
of the trends in the estimated state, as it eliminates or at least minimizes the statistical 
noise of the data streams. This is one of the best ways to assess the strength of a trend 
and the likelihood of its reversal, as it places more weight on the classification with the 
highest accuracy. It provides real indications before the start of a new situation or 
event, thus allowing for a quick and optimal decision. 

It is also important to note that this dynamic process ensures the adaptation of the 
system to new situations, by offering generalization which is one of the key issues in 
the field of machine learning. In this way we are implementing a robust framework 
capable of responding to high complexity problems. Also, this architecture greatly 
accelerates the process of making an optimal decision with the rapid convergence of the 
multiple model, which is less noisy and much more reliable than a single learning 
algorithm [23]. 
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3 Ensemble Algorithms 


3.1 Adaptive Random Forests 


It is clear that data flow management and especially knowledge extraction procedures 
with Machine Learning algorithms applied on the flows, are unlikely to be performed 
with iterations over input data. Accordingly, the adaptation of the Random Forest 
algorithm depends on a suitable accumulation process that is partly achieved by 
bootstrap, and partly by limiting any decision to divide the sheets into a subset of 
attributes. This is achieved by modifying the base tree algorithm, by effectively 
reducing the set of features examined for further separation into random subsets of size 
m, 6nov m < M (M corresponds to the total number of characteristics examined per 
case) [19]. 

In non-streaming bagging, each of the n-base models is trained in a Z-size bootstrap 
sample, created by random samples being substituted by the original training kit. Each 
bootstrapped sample contains a prototype training snapshot K, where P (K = k) follows 
a binomial distribution. For large values of Z this binomial distribution adheres to a 
Poisson distribution with X = 1. In contrast to the ARF method for streaming data, 
Poisson is used with A = 6 instead of Poisson A = 1. This “feedback” has the practical 
effect of increasing the possibility of assigning higher weights to instances during the 
training of the basic models. 

ARF is an adaptation of the original Random Forest algorithm, which has been 
successfully applied to a multitude of machine learning tasks. In layman’s terms the 
original Random Forest algorithm is an ensemble of decision trees, which are trained 
using bagging and where the node splits are limited to a random subset of the original 
set of features. The “Adaptive” part of ARF comes from its mechanisms to adapt to 
different kinds of concept drifts, given the same hyper-parameters. 

The overall ARF pseudo-code is presented below [19]. 


Algorithm 2. Adaptive Random Forests 
function ARF (m, n, 6,, 6,) 
T —CreateTrees(n) 
W —InitWeits(n) 
BO 
while HasNext(S) do 
(x, y) —next(S) 
for all t € T do 
Y —predict (t, x) 
We) —P (We), Y, y) 
RFTreeTrain (m, t, x, y) 
if C (ô, t, x, y) then 
b —CreateTrees() 
B(t) —b 
end if 
end for 
for all b € Bdo 
RFTreeTrain (m, b, x, y) 
end for 
end while 
end function 
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Where m: the maximum features evaluated per split; n: the total number of trees 
(n = |T); ôw: the warning threshold; õa: the drift threshold; c(-): the change detection 
method; S: the data stream; B: the Set of background trees; W(t): the Tree t weight; P 
(-): the learning performance estimation function. 


3.2 K-NN Classifier with Self Adjusting 


The k-NN SAM algorithm is inspired by the Short-Term and Long-Term memory 
(STM & LTM) model [27]. The information arriving in STM, are accompanied by 
relevant knowledge from the LTM. The information that receives enough attention is 
transferred in the LTM in the form of the Synaptic Consolidation. The memories are 
assigned the following sets Mgr, Mur, Mc which are subsets of the R” x {1,...,c}. 
The STM is a dynamic sliding window that contains the most recent m examples of the 
data flow [27]: 


Mgr = {(xi, y;) ER" x {1,...,c}[i=t—m+1,...,t} (3) 


The LTM retains all of the initial information and unlike the STM, it is not a 
continuous part of the data flow. It is a set of points p: 


Mir = {(xi,yi) ER" x 41)... c}li= 1,..., pj (4) 
The combined memory Cy is the union of both memories with size m + p: 
Mc = Msr U Mır (5) 
Each set includes the weighted k-NN classifier: 
R" x {1,...,c},k — NNms, k — NNy,,, k — NNuc (6) 


The k-NN approach assigns a label to each data point x based on a set 
Z = {(xi, yi) € R” x {1l,...,clli=1,...,n}: 


1 
k — NNz(x) = argmax 5 oe e=1,..,c (7) 


x;EN;(x,Z) |yi=e 


where d(x;,x) is the Euclidean distance between two points and N;(x, Z) returns the set 
comprising of the k nearest neighbors x in Z [27]. 


3.3 Primal Estimated Sub-Gradient Solver for SVM 


The SPegasos is a simple and effective stochastic sub-gradient descent algorithm for 
solving the optimization problem by using SVM [29]. Initially, w, is defined. In the 
t iteration of the algorithm, we use a random training example (x;,, y;,) by choosing an 
index i, € {1,..., m}. Then we use the following Eq. (8): 
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min’ wl? + = So Hows +) 6) 


xyes 


where J(w; (x, y)) = max{0, 1 — y(w,x)}, with a sample (x;,, yi,), giving input to the 
following function: 


loll? + 160; (xi, yi) (9) 


where 
V¿= Aw; — 1|y;, (we, Xi) < 1] ¥i,xi, (10) 


and 1[y; (We Xi,) < 1] is the index function, which takes the value 1 if the argument is 
true and it becomes equal to O in any other case. Then, we update the relation w, +1 — 
w, — N,V: by using weight step n, = +: After T iterations, the last value of the weight is 
the wr 4, [29]. 


4 Results and Discussion 


We have evaluated the performance of the proposed methods by measuring the average 
values for Kappa Statistic and Kappa Temporal Statistic. The results of all experiments 
are shown in the following Tables 1, 2 and 3. 

The learning evaluation used 10,000 instances and the validation of the results was 
done by employing the Prequential Evaluation method [3]. The training window used 
5,000 instances. Window based approaches were allowed to store 5,000 samples (for 
the sake of completeness, we also report the error rates of all window-based approaches 
with a window size of 1,000 samples) but never more than 10% of the whole dataset. 
This large amount gives the approaches a high degree of freedom and prevents the 
concealment of their qualities with a too restricted window. 


Table 1. Results for the water_tower_dataset 


Network traffic analysis 


Performance metrics 


Classifier Window size 5000 Window size 1000 
Kappa Kappa temporal Kappa Kappa temporal 
statistic statistic statistic statistic 
k-NN SAM 74.56% 75.29% 79.22% 79.96% 
SPegasos 72.07% 72.94% 74.65% 76.51% 
ARF 71.86% 72.47% 75.24% 17.72% 
Ensemble 73.52% 74.26% 77.51% 78.82% 
averaging 
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Table 2. Results for the gas_dataset 


Network traffic analysis 


Performance metrics 


Classifier Window size 5000 Window size 1000 
Kappa Kappa temporal Kappa Kappa temporal 
statistic statistic statistic Statistic 
k-NN SAM 72.03% 72.73% 74.18% 75.41% 
SPegasos 72.02% 72.69% 73.94% 75.01% 
ARF 71.83% 72.41% 73.87% 74.89% 
Ensemble 71.99% 72.66% 74.07% 75.23% 
averaging 


Table 3. Results for the electric_dataset 


Network traffic analysis 


Performance metrics 


Classifier Window size 5000 Window size 1000 
Kappa Kappa temporal Kappa Kappa temporal 
statistic statistic statistic Statistic 
k-NN SAM 75.72% 76.33% 78.93% 79.56% 
SPegasos 75.63% 76.12% 77.95% 78.93% 
ARF 74.47% 75.16% 76.18% 17.97% 
Ensemble 75.45% 76.05% 78.19% 79.17% 
averaging 


The assessment of the actual error of the data flow classifiers, is done in terms of the 
Accuracy Kappa statistic and the Kappa-Temporal statistic. The “true” label is pre- 
sented right after the instance has been used for testing, where there is a delay between 
the time an instance is presented and the moment in which its “true” label becomes 
available [30]. The use of the dynamically estimated weighted average is the optimal 
approach, considering that is solves a real problem of information systems security, 
where it is rare for all data flows to have the same importance. The algorithm, which 
has achieved the highest accuracy for each data stream, is multiplied by the corre- 
sponding weighting factor of 0.6, reflecting its transient superiority and hence the 
relative importance of the model to the particular algorithm at that time. 

Based on this technique, the model is led to a relatively smooth but high learning 
rate, which determines how quickly learning is converging. A high rate of learning can 
lead to faster convergence and oscillation around optimal weight values, while the low 
rate of learning results in slower convergence and can lead to trapping at local 
extremes. The high learning rate is confirmed by the high accuracy rates of the model, 
since very small size data flows are considered compared to the evaluation of a batch 
data set. According to this technique, the quality of the model’s adaptation is inter- 
preted as a “better forecasting” rate, due to the increased percentage of classification 
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precision. More specifically, the temporal bias created to the dynamics of a model at a 
specific time, is reflected in the high precision percentages of Table 1. 

An additional important interpretation, resulting from the high accuracy of the 9 
learning algorithms and the mild “mutation”, attributable to the dynamically deter- 
mined weighted average, is to assist in discovering the local extremes that may be 
included in a data flow or in a learning window. This is expected, since new areas of 
the multidimensional solution space are examined. 

On the contrary, if the “mutation” rate was too high, it could lead to a reduction in 
the exploitation of highly suitable areas of the solution space, and it could trap the 
system into solutions that do not generalize [30, 33]. An important comment also refers 
to the Kappa coefficient that links the level of observed agreement to the level of the 
random agreement. It estimates the variability in each observer rater variation that 
occurs when the same observer - evaluates differently in repeated evaluations of the 
same size. The maximum value of the Kappa index represents the full agreement 
between observers - markers, while the minimum value 0 is interpreted as there is only 
random agreement and thus no reliability between observers - markers. 

As we can see, there is considerable reliability in all cases tested, which also 
strengthens the overall reliability and usability of the proposed model. Similarly, by 
attempting a comparison of the results between the algorithms, we see that the ARF 
method generally needs a larger number of cases to yield new data. In addition, ARF 
works by combining some loose linear boundaries on the decision surface, as opposed 
to SPegasos which can achieve max margin in non-linear boundaries. Therefore, given 
that sliding windows are characterized by a small amount of data, SPegasos yielded 
higher success rates than ARF. Regarding the comparison between SPegasos and k-NN 
SAM, an clear reason that k-NN SAM performed better, is because a particular 
problem is located in a high-dimensional space where this algorithm is more efficient. 
Also, the optimal combination of the two levels of memory, the different retention 
intervals between the memories and the transfer of knowledge, has been shown to 
minimize errors and to increase classification accuracy. 


5 Conclusions 


An innovative, reliable and highly effective cyberattack detection system, based on 
sophisticated computational intelligence, was presented in this paper. The DELDaStrA, 
is an innovative effort to analyze large-scale, reliable and accurate data flows in order to 
detect cyber-attacks in critical infrastructure networks. The implementation of DEL- 
DaStrA was based on the philosophy of the dynamic ensemble learning method, which 
ensures the adaptation of the system to new situations offering impartiality and gen- 
eralization. It is a robust framework capable of responding to high complexity prob- 
lems. The performance of the proposed system has been tested by using three 
multidimensional datasets of high complexity. These datasets were obtained after 
extensive research in the operation of ICS (SCADA, DCS, PLC). They realistically 
state the operating states of these devices under normal conditions and under situations 
of cyberattacks. The very high precision results that have emerged, reinforce the 
general methodology followed. Proposals for the development and future 
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improvements of this system, should focus on further optimizing the algorithms used to 
achieve an even more efficient, accurate and faster classification process. Also, new 
approaches for further optimization should be considered, by employing self- 
improvement and adaptive learning, which will fully automate the cyber-detection 
process. 
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Abstract. The conventional Gaussian kernel-based fuzzy c-means clus- 
tering algorithm has widely demonstrated its superiority to the conven- 
tional fuzzy c-means when the data sets are arbitrarily shaped, and not 
linearly separable. However, its performance is very dependent on the 
estimation of the bandwidth parameter of the Gaussian kernel function. 
Usually this parameter is estimated once and for all. This paper presents 
a Gaussian fuzzy c-means with kernelization of the metric which depends 
on a vector of bandwidth parameters, one for each variable, that are 
computed automatically. Experiments with data sets of the UCI machine 
learning repository corroborate the usefulness of the proposed algorithm. 


1 Introduction 


Clustering means the task of organizing a set of items into clusters such that 
items within a given cluster have a high degree of similarity, while items belong- 
ing to different clusters have a high degree of dissimilarity. Clustering has been 
successfully used in different fields, including bioinformatics, image processing, 
and information retrieval [14,21]. 

Hierarchy and Partition are the most popular cluster structures provided 
by clustering methods. Hierarchical methods yield a complete hierarchy, i.e., a 
nested sequence of partitions of the input data, whereas partitioning methods 
aims to obtain a single partition of the data into a fixed number of clusters, 
usually based on an iterative algorithm that optimizes an objective function. 

Partitioning methods can be divided into crisp and fuzzy. Crisp clustering 
provides a crisp partition in which each object of the dataset belongs to one and 
only one cluster. Fuzzy clustering [1] generates a fuzzy partition that provides 
a membership degree for each object in a given cluster. This allows distinguish 
objects that belong to more than one cluster at the same time [15]. 

Fuzzy c-means partitioning algorithms often use the Euclidean distance to 
compute the dissimilarity between the objects and the cluster representatives. 
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However, when the data structure is complex (i.e., clusters with non-hyper- 
spherical shapes and/or linearly non-separable patterns), the conventional fuzzy 
c-means will not be able to provide effective results. Kernel-based clustering 
algorithms have been proposed to tackle these limitations [3,6,8,9]. 

There are two major variations of kernel-based clustering: one is the kernel- 
ization of the metric, where the cluster centroids are obtained in the original 
space and the distances between objects and cluster centroids are computed by 
means of kernels, while the other is the clustering in feature space, in which the 
cluster representatives are not in the original space and can only be obtained 
indirectly in the feature space [3,9]. 

In kernel-based clustering algorithms it is possible to compute Euclidean 
distances by using kernel functions and the so-called distance kernel trick [9]. 
This trick uses a kernel function to calculate the dot products of vectors implicitly 
in the higher dimensional space using the original space. 

The most popular kernel function in applications is the Gaussian kernel. In 
general, this kernel function provides effective results and requires the tuning 
of a single parameter, that is, the bandwidth parameter [4]. This parameter 
is tuned once and for all, and it is the same for all variables. Thus, implicitly 
the conventional Gaussian kernel fuzzy c-means assumes that the variables are 
equally rescaled and, therefore, they have the same importance to the clustering 
task. However, it is well known that some variables have different degrees of 
relevance while others are irrelevant to the clustering task [7,11,17, 20]. 

Recently, Ref. [5] proposed a Gaussian kernel c-means crisp clustering algo- 
rithm with kernelization of the metric, where each variable has its own hyper- 
parameter that is iteratively computed during the running of the algorithm. 

The main contribution of this paper is to provide a Gaussian kernel c-means 
fuzzy clustering algorithms, with both kernelization of the metric and automated 
computation of the bandwidth parameters using an adaptive Gaussian kernel. In 
these kernel-based fuzzy clustering algorithm, the bandwidth parameters change 
at each algorithm iteration and differ from variable to variable. Thus, these 
algorithms are able to rescale the variables differently and thus select the relevant 
ones for the clustering task. 

The paper is organized as follows. Section 2 first recalls the conventional 
kernel c-means fuzzy clustering algorithm with kernelization of the metric. Then 
presents the Gaussian c-Means fuzzy clustering algorithm with kernelization of 
the metric and with automatic computation of bandwidth parameters. In Sect. 3, 
experiments with data sets of the UCI machine learning repository corroborate 
the usefulness of the proposed algorithm. Section 4 provides the final remarks of 
the paper. 


2 Kernel Fuzzy c-Means with Kernelization 
of the Metric 


This section briefly recalls the basic concepts about kernel functions and the 
conventional kernel c-means algorithm with kernelization of the metric. Let E = 
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[e1,...,€n y be a set of n objects described by p real-valued variables. Let D = 
{x1,...,Xn} be a non-empty set where for k = 1,...,n, the kt” object ex is 
represented by a vector xX, = (Zp1,..., Uxp) ERP. A function K: Dx D — Ris 
called a positive definite Kernel (or Mercer kernel) if, and only if K is symmetric 
(Le., K(x, X1) = K(x1, xx) and if the following inequality holds [18]: 


n n 


5 5 aCkK(X1, Xk) > 0,Vn > 2 (1) 


l=1 k=1 


where c, x € R(1 <1,k <n). 

Let 9: D — F be a nonlinear mapping from the input space D to a high 
dimensional feature space F. By applying the mapping 9, the inner product 
x} x; in the input space is mapped to (x;)7@(x,) in the feature space. The 
basic notion in the kernel approaches is that the non-linear mapping ® does not 
need to be explicitly specified because each Mercer kernel can be expressed as 
K(x1, Xk) = P(x) T D(x;) [18]. 

One the most relevant implications is that it is possible to compute Euclidean 
distances in F without knowing explicitly $, by using the so-called distance 
kernel trick [9]: 


|| P(x1) — B(xx)|| = (P(x) — O(xx)) (P(x) — D(xp)) 
= D(x) P(x) — 28x1) (xx) + P(x)" D(x;) 
= K(x), xı) = 2K (x, Xp) + K(Xk, Xp). 


2.1 Kernel Fuzzy c-Means with Kernelization of the Metric 


The kernel fuzzy c-means with kernelization of the metric (hereafter named 
KFCM-K) provides a fuzzy partition of E into c clusters, represented by a matrix 
of membership degrees U = (uri) (1<k<n;1< i< c), and a matrix of cluster 
representatives (called hereafter matrix of prototypes) G = (g1,...,g-) of the 
fuzzy clusters in the fuzzy partition U. The prototype of cluster i(i = 1,...,c) 
is represented by the vector g; = (gi1,---, Jip) € RP. 

From an initial solution, the matrix of prototypes G and the fuzzy partition 
U are obtained iteratively in two steps (representation and assignment) by the 
minimization of a suitable objective function, here-below denoted as Jy rom_—kK; 
that gives the total heterogeneity of the fuzzy partition computed as the sum of 
the heterogeneity in each fuzzy cluster: 


c 


JKFCM-K(G, U) ED uri)” ||B(xr) — Blei) ||? (2) 


i=1 k=1 


where 1 < m < oo is the fuzziness parameter. Using the so-called distance kernel 
trick [9], we have ||®(x;,) — ®(g;)||? = K(x, xk) — 2K (xp, gi) + (gi, gi). 
Hereafter we consider the Gaussian kernel, the most commonly used in the lit- 
erature: C(x,, xx) = exp { ~] = exp { i Dzi 4 (21; 2), where 
2 is the bandwidth parameter of the Gaussian kernel. 
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Then, K(xx,xx) = 1, Vk, K(gi,gi) = 1,Vi, and ||O(x,) — ®(g,)||? = 2 — 
2K(xx,g;) and thus, the objective function Jeruc—K becomes: 


JKFCM-K(G, U) = 25 51 (uki)” (1 — K(Xk, gi)) (3) 


i=1 k=1 


During the representation step, the fuzzy partition U is kept fixed. The 
objective function JK ryc—x is optimized with respect to the prototypes. Thus, 
from OJKEMCG-K = 0 and after some algebra, the fuzzy cluster prototypes are 
obtained as follows: 

n m 
gi = Ir (Uri) KB (1 <i< 0). (4) 
pai (Uki) (Xx, Bi) 


In the assignment step, the cluster prototypes are kept fixed. The compo- 
nents uki (1 < k < n;1 < i < c) of the matrix of membership degrees U, that 
minimizes the clustering criterion given in Eq. (3), are computed as follows: 


TE (UK, gi) \ 
Uki = Ds € = en) | ; (5) 


2.2 KFCM-K with Automatic Computation of Bandwidth 
Parameters 


The kernel fuzzy c-means with kernelization of the metric and automatic com- 
putation of bandwidth parameters (hereafter named KFCM-K-H) provides a 
partition of E into c clusters, represented by a matrix of membership degrees 
U = (uri) (1 < k < n;1 < i < c), a vector of bandwidth parameters (one for 
each variable) s = (s1, ..., s2) and a matrix of prototypes G = (81,...,8c) of 
the fuzzy clusters in the fuzzy partition U. 

From an initial solution, the matrix of prototypes G, the vector of band- 
width parameters s and the fuzzy partition U are obtained interactively in three 
steps (representation, computation of the bandwidth parameters and assign- 
ment) by the minimization of a suitable objective function, here-below denoted 
as JKFCM-K-nH, that gives the total heterogeneity of the fuzzy partition com- 
puted as the sum of the heterogeneity in each fuzzy cluster: 


c 


JKFCM-K-H(G,5,U) = DN u)" |D(xx) — (gi)? (6) 
i=1 k=1 
where 
lE) — Blg)? = KO (xn, xx) — 2100 (xx, Bi) + KO (gi, Bi) (7) 
with 


KO (x), x") = exp (Zij — Kj)” 


2 


j=l 


m 
Sol = 
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Because CS (xp, xk) = 1, Vk, CO (8,,8;) = 1,Vi, and ||®(x,) — O(g,)||? = 
pee Oa) (Xz, gi), the objective function Jk ruc—K—H becomes: 


e n 


Ji rom—K—H(G,8,U) =2 9 X (uri) "(1 — KO (xp, 84)) (8) 
i=1 k=1 


During the representation step, the vector of bandwidth parameters s and 
the fuzzy partition U are kept fixed. The objective function Jk*+mc-kx-H is 
optimized with respect to the prototypes. Thus, from A MCR KH = 0 and 


after some algebra, the cluster prototypes are obtained as follows: 


M ym (s) ‘ 
Ir Uni) K (Xk, Bi) Xk dido) (9) 
pai (Uri) CS) (xx, 84) 


In the computation of the bandwidth parameters step, the matrix of pro- 
totypes G and the fuzzy partition U are kept fixed. First, we use the method 


gi = 


of Lagrange multipliers with the restriction that I- 4) = y, where y is a 
3 


suitable parameter, and obtain 


p 
1 
Lkrem-k-n(G;s, U) DE (uri) (1 K® (xx, 85) - w IIs-+ 
i=1 k=1 j=1 "J 
(10) 
Then, we compute the partial derivatives of NE w.r.t = and w, 
and by setting the partial derivatives to zero, and after some algebra we obtain 


EE Eon" (us - on)? § 
> 


a 


= (I<j<p. (11) 
(Uki) KO) (Xk, Bi) (er — Jij)? 


In the assignment step, the matrix of fuzzy cluster prototypes G and the 
vector of bandwidth parameters s are kept fixed. First, we use the method of 
Lagrange multipliers with the restriction that we Uri = 1, and obtain 


Lkrocum-k-H(G,s, U) 855.1 ui)” (IK (xx, g:))- — or Dei) 


i=1 k=1 k=1 
(12) 
Then, we compute the partial derivatives of Le Aik a w.r.t Up; and wk, 
and by setting the partial derivatives to zero, and after some algebra we obtain 


—1 


Ja (1 KO 08) y" ee 
w= [S (ee) | De A 
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Algorithm 1. KCM-K and KCM-K-H algorithms 


1: Iput 
2: D = {x1,...,Xn} (the data set); c (the number of clusters); y > 0 (a suitable 
parameter); T (maximum number of iterations); e (threshold parameter); 

3: Output 

4: KCM-K-GH and KCM-K-LH: the matrix of prototypes G = (g1,..., 8c); 
5 KCM-K-H: the vector of bandwidth parameters s = (s?, das s2); 

6 


KCM-K-GH and KCM-K-LH: the matrix of membership degrees U = (uki) G< 
k<njl<i<c). 
E: Initialization 
: t = 0; 
9 


) 


KCM-K and KCM-K-H: randomly select c distinct prototypes gi? eED(1<:i< 


i 


c); 
1 
10: KCM-K-H: set aos = (7)? (1<j <p); 
Ss; 
y. 

11: KCM-K: compute the components uO (1 <k<n;1<i<c) of the the matrix 

of membership degrees U) according to Eq. (5); 
12: KCM-K-H: compute the components ul) (l<k<n;1<i<c) of the the matrix 

of membership degrees UC) according to Eq. (13). 
13: KCM-K: compute Jkrom-k(G(?, U()) according to Eq. (3); 
14: KCM-K-H: compute Jxrom—K—u(G™,s,U) according to Eq. (8). 
15: repeat 
16: t=t+1; 
17: Step 1: representation. 
18: KCM-K: compute the cluster representatives gs”, bees gi) using Eq. (4); 
19: KCM-K-H: compute the cluster representatives g”, Ba) gi) using Eq. (9). 
20: Step 2: computation of the vector of bandwidth parameters 
21: KCM-K: skip this step; 
22: KMC-K-H: compute the vector of bandwidth parameters s“) using Eq. (11); 
23: Step 3: assignment 
24: KCM-K: compute the components ul) (1 <k<n;1<i<c) of the the matrix 

of membership degrees U) according to Eq. (5). 

25: KCM-K-H: compute the components ul) (1 < k < n;1 < i < c) of the the 


matrix of membership degrees uw according to Eq. (13). 
26: KCM-K: compute Jerom—K(G,U™) according to Eq. (3). 
27: KCM-K-H: compute Je rom—K—H(G™,s®,U™) according to Eq. (8). 
28: until 


29: KCM-K: |Jxrom—K(G,U®) — Je rom_—x(GU-Y), UC-Y)| < eor t >T; 
30: KCM-K-H: 
lIxrom—K—n(G,s,U) — Ip pom—K—n(G-),s-), UC-D)| < e or 
t>T. 


2.3 The Algorithms 


The two steps of KFCM-K and the three steps of KFCM-K-H are repeated until 
the convergence. The Algorithm 1 summarizes these steps. 


3 Empirical Results 


This section discusses the performance and the usefulness of the proposed algo- 
rithm in comparison with the standard KFCM-K and FCM [1] algorithms. 
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Twelve datasets from the UCI Machine learning Repository [2], namely, 
Breast tissue, Ecoli, Image segmentation, Iris plants, Leaf, Libras Movement, 
Multiple features, Seeds, Thyroid gland, Urban land cover, Breast cancer wis- 
consin (diagnostic), and Wine, with different number of objects, variables and a 
priori classes, were considered in this study. Table 1 (in which n is the number 
of objects, p is the number of real-valued variables and K is the number of a 
priori classes) summarizes these data sets. 


Table 1. Summary of the data sets 


Data sets n p K | Data sets n p K 
Breast tissue 106 | 9 6 | Multiple features 2000 | 649 | 10 
Ecoli 336 8 | Seeds 210 7| 3 
Image segmentation | 2100 |19 | 7 | Thyroid gland 215 513 
Tris 150 | 4 | 3 | Urban land cover 675 |148 | 9 
Leaf 310 | 14 36 | Brest cancer winsconsin | 569 | 30 | 2 
Libras Movement 360 |90 15 | Wine 178 | 13 | 3 


FCM, KFCM-K and KFCM-K-H were run on these data sets 100 times, 
with c (the number of clusters) equal to K (the number of a priori classes). The 
parameter gamma of the KFCM-K-H algorithm was set as y = (0?)?, where ø is 
the optimal width hyper-parameter used in the conventional KFCM-K algorithm 
that is estimated as the average of the 0.1 and 0.9 quantiles of ||x; — xx||?, L Æ k 
[4]. The fuzziness parameter was set as m = 1.6 and m = 2.0. 

To compare the quality of the fuzzy partitions provided by these algorithms, 
the Rand index for a fuzzy partition (Rand-F) [10], and the Hullemeyer index 
(HUL) [13] were considered. Rand-F and HUL indexes allow to compare the 
dataset a priori partition with the fuzzy partitions provided by the algorithms. 
They range between 0 and 1, where a value equal to one corresponds to total 
agreement between the partitions. 

Table2 shows the best results (according to the respective objective func- 
tions) of the FCM, KCM-K and KCM-K-H algorithms on the data sets of Table 1, 
according to the Rand-F and HUL indexes and for the fuzziness parameter set 
as m = 1.6 and m = 2.0. 

It can be observed that whatever the considered indexes (Rand-F and HUL), 
FCM (for the great majority of the datasets), KFCM-K (for the great major- 
ity of the datasets) and KFCM-K-H (for the totality of the datasets) algo- 
rithms performed better with the m = 1.6. Moreover, whatever the considered 
indexes and fuzziness parameters, the KFCM-K-H algorithm performed better 
than the KFCM-K algorithm on the majority of the data sets of the Table 1. 
Besides, whatever the considered indexes and fuzziness parameters, the stan- 
dard FCM algorithm outperformed both KFCM-K and KFCM-K-H algorithms 
on the majority of the datasets of the Table 1. This is not unexpected because 
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Table 2. Performance of the algorithms: fuzzy partition 


Rand-F HUL 
FCM KFCM-K KFCM-K-H FCM KFCM-K KFCM-K-H 

Data sets m = 1.6|m = 2.0/m = 1.6/m = 2.0)m = 1.6)m = 2.0||m = 1.6)m = 2.0)m = 1.6)m = 2.0}m = 1.6|m = 2.0 
Breast tissue 0.6236 | 0.6296 | 0.7153 | 0.7280 | 0.7268 | 0.7095 || 0.6205 | 0.6143 | 0.6937 | 0.6183 | 0.7163 | 0.6505 
Ecoli 0.7624 | 0.7239 | 0.7246 | 0.6969 | 0.7653 | 0.7211 || 0.7445 | 0.6470 | 0.6364 | 0.5263 | 0.7524 | 0.6448 
mage segmentation 0.7796 | 0.7787 | 0.7571 | 0.7588 | 0.8671 | 0.8113 || 0.7171 | 0.5974 | 0.4155 | 0.3111 | 0.8404 | 0.7049 
ris 0.8620 | 0.8131 | 0.7723 | 0.6881 | 0.7999 | 0.7037 || 0.8641 | 0.8187 | 0.7661 | 0.6632 | 0.8004 | 0.6880 
Leaf 0.9528 | 0.9450 | 0.9499 | 0.9452 | 0.9551 | 0.9462 || 0.9035 | 0.7416 | 0.8223 | 0.5798 | 0.8772 | 0.6414 
Libras Movement 0.8887 | 0.8793 | 0.8813 | 0.8789 | 0.8815 | 0.8788 || 0.6543 | 0.2603 | 0.3998 | 0.2656 | 0.4039 | 0.2599 
Aultiple features 0.8689 | 0.8500 | 0.8284 | 0.8227 | 0.8291 | 0.8226 || 0.8336 | 0.7141 | 0.3915 | 0.2707 | 0.3967 
Seeds 0.8287 | 0.7608 | 0.6047 | 0.5798 | 0.7515 | 0.6675 || 0.8268 | 0.7543 | 0.5133 | 0.4547 | 0.7412 
‘Thyroid gland 0.7195 | 0.6070 | 0.4941 | 0.4913 | 0.5731 | 0.4985 || 0.7213 | 0.6175 | 0.4634 | 0.4647 | 0.5856 | 0. 
Urban land cover 0.7320 | 0.7477 | 0.7836 | 0.7830 | 0.8039 | 0.7852 || 0.6371 | 0.5184 | 0.2960 | 0.2085 | 0.5442 | 0.2755 
Brest cancer winsconsin| 0.7317 | 0.7151 | 0.5000 | 0.5000 | 0.7531 | 0.6452 || 0.7346 | 0.7234 | 0.5046 | 0.5095 | 0.7731 | 0.6978 
Wine 0.7015 | 0.6758 | 0.8198 | 0.6664 | 0.7683 | 0.6471 || 0.6987 | 0.6651 | 0.6934 | 0.6533 | 0.7792 | 0.6279 


as pointed out by Ref. [19], kernelization may impose undesirable structures on 
the data, and hence, the clusters obtained in the kernel space may not exhibit 
the structure of the original data. 

From the fuzzy partition U it is obtained a crisp partition Q = (Q1,...,Q.), 
where the cluster Q;(¢ = 1,...,c) is defined as: Q; = {er € E : uik = MAX tng}. 


To compare the quality of the crisp partitions provided by KFCM-K and KFCM- 
K-H algorithms, the adjusted Rand index (ARI) [12], and the mutual normalized 
information (MNI) [16] were considered. ARI and MNI indexes allow to compare 
the dataset a priori partition with the crisp partitions obtained from the fuzzy 
partitions provided by the algorithms. ARI index takes its values on the interval 
[-1,1], in which the value 1 indicates perfect agreement between partitions. The 
NMI takes its values on the interval [0,1], in which the value 1 also indicates 
perfect agreement between partitions. 

Table 3 shows the best results (according to the respective objective func- 
tions) of the KCM-K and KCM-K-H algorithms on the data sets of Table 1, 
according to the ARI and NMI indexes and for the fuzziness parameter set as 
m = 1.6 and m = 2.0. 


Table 3. Performance of the algorithms: crisp partition 


ARI NMI 
FCM KFCM-K KFCM-K-H FCM KFCM-K KFCM-K-H 
Data sets m = 1.6)m = 2.0/m = 1.6|m = 2.0|m = 1.6|m = 2.0/m = 1.6|m = 2.0}m = 1.6|m = 2.0/m = 1.6|m = 2.0 
Breast tissue 0.1101 | 0.1252 | 0.2065 | 0.1143 | 0.2944 | 0.2934 | 0.3083 | 0.3218 | 0.3428 | 0.2766 | 0.5515 | 0.5356 
Ecoli 0.3880 | 0.3682 | 0.3230 | 0.3277 | 0.4182 | 0.4015 | 0.5721 | 0.5514 | 0.5232 | 0.5162 | 0.6075 | 0.5927 
Image segmentation 0.3116 | 0.3045 | 0.2133 | 0.2832 | 0.5257 | 0.4429 | 0.4920 | 0.4670 | 0.2931 | 0.3885 | 0.6444 | 0.6234 
Iris 0.7163 | 0.7294 | 0.8015 | 0.7859 | 0.8856 | 0.9037 | 0.7419 | 0.7496 | 0.7899 | 0.7773 | 0.8641 | 0.8801 
Leaf 0.3145 | 0.2654 | 0.3096 | 0.2877 | 0.3566 | 0.3602 | 0.6652 | 0.6477 | 0.6738 | 0.6538 | 0.7061 | 0.6981 
Libras Movement 0.3227 | 0.1667 | 0.2726 | 0.1515 | 0.2419 | 0.2070 | 0.5821 | 0.3767 | 0.5315 | 0.4399 | 0.5239 | 0.4970 
Multiple features 0.4280 | 0.4206 | 0.4128 | 0.3615 | 0.5324 | 0.4073 | 0.5669 | 0.5612 | 0.6053 | 0.5693 | 0.6397 | 0.6303 
Seeds 0.7166 | 0.7166 | 0.7034 | 0.7034 | 0.6975 | 0.6954 | 0.6949 | 0.6949 | 0.6737 | 0.6737 | 0.6804 | 0.6716 
Thyroid gland 0.5698 | 0.4413 | 0.0588 | 0.0495 | 0.1538 | 0.1819 | 0.4088 | 0.3434 | 0.1500 | 0.1353 | 0.2793 | 0.3240 
Urban land cover 0.0373 | 0.0737 | 0.0491 | 0.4289 | 0.2626 | 0.1345 | 0.1272 | 0.1909 | 0.1383 | 0.5011 | 0.3640 
Brest cancer winsconsin 0.4914 | 0.0215 | 0.0176 | 0.7178 | 0.7182 | 0.4567 | 0.4647 | 0.0593 | 0.0561 | 0.6174 | 0.6045 
Wine 0.3602 | 0.3539 | 0.3711 | 0.3749 | 0.8332 | 0.8482 | 0.4212 | 0.4167 | 0.4287 | 0.4315 | 0.8199 | 0.8329 
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It can be observed that also in this case whatever the considered indexes 
(ARI and NMI), for the majority of the datasets, FCM, KFCM-K and KFCM- 
K-H algorithms performed better with the m = 1.6. Moreover, whatever the 
considered indexes and fuzziness parameters, the KFCM-K-H algorithm outper- 
formed both the KFCM-K and FCM algorithms on the majority of the data sets 
of the Table 1. Besides, for the ARI index and whatever the considered fuzziness 
parameters, the standard FCM algorithm performed better than the KFCM-K 
algorithm on the majority of the datasets of the Table 1. 


4 Final Remarks and Conclusions 


The clustering performance of the conventional KFCM-K, the gaussian kernel- 
based fuzzy clustering algorithm, is highly related to the estimation of the band- 
width parameter of the Gaussian kernel function, that is estimated once and 
for all. In this paper we proposed KFCM-K-H, a Gaussian fuzzy c-Means with 
kernelization of the metric and automatic computation of a vector of bandwidth 
parameters, one for each variable. In the proposed kernel-based fuzzy clustering 
algorithm, the bandwidth parameters change at each iteration of the algorithm 
and are different from variable to variable. Thus, the proposed algorithm is able 
to select the important variables for the clustering task. 

Experiments with twelve data sets from UCI machine learning repository, 
with different number of objects, variables and a priori classes, showed the per- 
formance of the proposed algorithm. It was observed that, for the majority of 
these data sets, the proposed KFCM-K-H algorithm provided crisp and fuzzy 
partitions of better quality than those provided by the conventional KFCM-K 
algorithm. Moreover, the KFCM-K-H algorithm provided crisp partitions of bet- 
ter quality than those provided by the standard FCM algorithm. Besides, it was 
observed that the FCM algorithm outperformed both KFCM-K and KFCM-K-H 
algorithms on the majority of these data sets, concerning the quality of the fuzzy 
partitions. These later finds support the remark provided by Ref. [19], i.e, that 
the kernelization may impose undesirable structures on the data and the clusters 
obtained in the kernel space may not exhibit the structure of the original data. 
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Abstract. Symbolic Data Analysis provides suitable new types of vari- 
able that can take into account the variability present in the observed 
measurements. This paper proposes a partitioning fuzzy clustering algo- 
rithm for interval-valued data based on suitable adaptive Euclidean dis- 
tance and entropy regularization. The proposed method optimizes an 
objective function by alternating three steps aiming to compute the 
fuzzy cluster representatives, the fuzzy partition, as well as relevance 
weights for the interval-valued variables. Experiments on synthetic and 
real datasets corroborate the usefulness of the proposed algorithm. 


1 Introduction 


Clustering methods seek to organize a set of items into clusters such that objects 
within a given group have a high degree of similarity, whereas elements belong- 
ing to different clusters have a high degree of dissimilarity [15]. Partition and 
hierarchy are the most popular cluster structures provided by clustering meth- 
ods. Hierarchical methods yield a complete hierarchy, i.e., a nested sequence of 
partitions of the input data, whereas partitioning methods aims to obtain a sin- 
gle partition of the data into a fixed number of clusters, usually based on an 
iterative algorithm that optimizes an objective function. 

Partitioning methods can be divided into hard and fuzzy clustering. Hard 
clustering methods restrict each point of the dataset to exactly one cluster. On 
the other hand, in fuzzy clustering, a pattern may belong to all clusters with a 
specific membership degree. Generally, in conventional clustering methods, all 
the variables participate with the same importance to the clustering process. 
However, in real situations, some variables could be more or less important or 
even irrelevant for this task. A better solution is to introduce the proper attribute 
weight into the clustering process [16]. 

Most clustering algorithms are defined to deal with data described by single- 
valued variables, i.e., variables that takes a single measurement or a category 
for an object. However, there are many other kinds of information that cannot 
be explained with single-valued variables. For example, to take into account 
© Springer Nature Switzerland AG 2018 
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variability inherent to the data, variables must be multi-valued, assuming sets 
of categories or intervals, possibly even with frequencies or weights. These kinds 
of data have been mainly studied in Symbolic Data Analysis (SDA), a domain 
related to multivariate analysis, pattern recognition and artificial intelligence. 
The SDA aim is to provide suitable methods for managing aggregated data 
described by multi-valued variables [2]. 

Hard and fuzzy clustering methods are already available for manage interval- 
valued data. For example, Ref. [17] introduced a fuzzy clustering algorithms for 
mixed features of symbolic and fuzzy data. In these fuzzy clustering algorithms, 
the membership degree is associated to the values of the features in the clus- 
ters for the cluster centers instead of being associated to the patterns in each 
group, as is the usual case. De Carvalho [4] presented a fuzzy C-means clustering 
algorithms based on suitable Euclidean distances for interval valued-data. 

This paper presents a new fuzzy C-means type algorithm based on adap- 
tive Euclidean distances with Entropy Regularization for interval-value data, 
where the adaptive distance takes into account lower and upper boundaries of 
the data. The improvement in comparison with Ref. [4] concerns a new auto- 
matic weighting scheme for the interval boundaries. The weights of the lower 
and upper boundaries in Ref. [4] are managed independently. In that case, even 
if a boundary plays a minor role concerning the others, the algorithm of Ref. [4] 
may assign a relevant contribution also if it is not relevant. Following Ref. [14], 
this paper proposes a solution to solve this side effect. The proposed fuzzy clus- 
tering algorithm alternates three steps: allocation, weighting and representation 
steps that computes the objects memberships to the clusters, the weights for 
each variable and/or each boundary and the clusters’ prototypes respectively, 
until a stationary value of a homogeneity criterion is reached. 

Section2 presents the fuzzy clustering algorithm based on Adaptive 
Euclidean distance and entropy regularization. Section 3 provides several exper- 
iments with synthetic and real datasets that corroborate the usefulness of the 
proposed algorithm. Finally, conclusions are drawn in Sect. 4. 


2 Fuzzy Clustering Algorithm Based on Adaptive 
Euclidean Distance and Entropy Regularization 
for Interval Data 


This section describes the proposed fuzzy clustering algorithm based on Adaptive 
distance and Entropy Regularization for interval-valued data (hereafter referred 
as AIFCM-ER). 

Let E = {e1,...,en} be a set of N objects described by P interval-valued 
variables [2]. An interval-valued variable is a mapping that it is defined from 
the dataset E into the set S$ of closed intervals of IR. In other words, for any 
e € E the value y(e) is an interval of the form [a,b] where a, b are some real 
numbers such that a < b. The i-th object e;(1 < i < N) is represented by a 
vector X; = (za, 0.0) CEP), where Tij = [055 bij]; with Qij < bij, is the interval 
value taken by the j-th variable (1 < j < P). Let D = {x1,..., Xy} be the 
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interval-valued dataset. Each fuzzy cluster P¿(k = 1,...,C) has a representative 
element, called hereafter a prototype. As the examined variables are interval- 
valued, each prototype gx = (9r1,---» 9P) is a vector of P intervals with gy, = 
[akj Bri] (1 <j<P¡1<k<OC). 

Conventional clustering models consider that all variables are equally impor- 
tant to the clustering task. However, in most applications some variables may 
be irrelevant and, among the relevant ones, some may be more or less relevant 
than others. Furthermore, the relevance of each variable to each cluster may be 
different, i.e., each cluster may have a different set of relevant variables [5,8]. In 
previous works, the boundary weights of the interval data were assigned inde- 
pendently. Therefore, it was not possible to compare the relevance of the lower 
boundaries concerning the upper boundaries. Furthermore, if the lower/upper 
boundary is not (or very few) relevant for the clustering process, a set of weights 
that are significantly greater than zero is always assigned. For overcome these 
drawbacks, is proposed to consider the jointly weighting of the lower and upper 
boundaries [14]. 

In this paper, we will denote as Vi = (v¡1,..., Vlk; -3 Vic) and Vy = 
(Vu,1; > Vu,ks ++) Vu,c) the matrices of positive weights for the lower and upper 
boundaries respectively. The vi, = (0x1, ++) Vike) and Vu,k = (Uu,k1) +» Vu, kP) 
are the P-dimensional vectors of relevance weights and each of these weights mea- 
suring the importance of each interval-valued variable on the ¿-th fuzzy cluster 
for lower and upper boundaries respectively. 

The proposed algorithm provides a fuzzy partition represented by the matrix 


U = (u;,...,Uy) = (Us) 1<i<w, where uj, is the membership degree of object 
1<k<C 
ei into the fuzzy cluster k and u; = (uj1,..., Uic), a matrix of prototypes G = 


(81, --,8c) that represents the fuzzy clusters in the fuzzy partition, as well as 
the matrices of relevance weights of the variables V; and Vy. 

The matrix of prototypes G, the matrices of positive weights V;, Va and the 
matrix of membership degrees U are obtained iteratively by the minimization of 
a suitable adequacy criterion, here-below denoted as JArFCM- ER, that gives the 
total homogeneity of the fuzzy partition computed as the sum of the homogeneity 
in each fuzzy cluster: 


N C N 
> (ur)d = koVu,k) (Xi, 81) PR (uik) ) In( (tik) (1) 


i=l k=1 ¿=1 


JAIFCM—ER 


Mo 


subject to: 2 Wik) = 1 and Il Ul kjVu,kj) = 1 
k= j=l 


we 


where da r (Xi, gk) = | ULkj (aij — akj) + Vu, kj (bij — Bri)? ] (2) 
j=l 

is a suitable adaptive dissimilarity between the vectors of intervals x; and gx 

parameterized by the vectors of relevance weights of the variables v; y and Vu,k 

on the fuzzy cluster k, for lower and upper boundaries respectively. The second 
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term is the negative entropy and is used to control the membership degree uig. 
Ty, is a positive regularizing parameter. 

In the literature, two main types of constraints are proposed: a product-to- 
one constraint [5] and a sum-to-one constraint [10]. However, in this paper, 
we will not consider this last alternative because it depends on the setting 
of additional parameters. Thus, this dissimilarity function is parameterized by 
the vectors of relevance weights vj, and Vu,k, in which vkj > 0, vur; > 0 
and Mi ULkjUu,kj = 1, and it is associated with the k-th fuzzy cluster 
(k = 1,...,C). Note that the vectors of weights v¡x = (Ui,k1,---,UkP) and 
Vu,k = (Vu,k1; +++) Vu, RP) are estimated locally and change at each iteration, i.e., 
they are not determined absolutely, and are different from one cluster to another. 
Moreover, note also that the relevant variables in the groups have weights that 
are superior to 1. 


2.1 The Optimization Steps of the AIFCM-ER Algorithm 


This section provides the optimization algorithm aiming to compute the pro- 
totypes, the relevance weights of the variables and the fuzzy partition. For the 
AIFCM-ER algorithm, the minimization of Jarrcm-Er (Eq. 1) is performed 
iteratively in three steps (representation, weighting, and allocation). 

The computation of the matrices U, V; and V,, can be obtained applying 
the Lagrange multipliers A; and yz to the constraints of Eq. 1 as: 


C P 


N c 
L = JAIFCM-ER — 5 Yk II ULkjUVu ij — 1| — 5 Ài Y Uik — | (3) 
k=1 


k=1 j=l i=l 


Representation Step: This section provides the solution for the optimal com- 
putation of the prototype associated to each cluster. During this step, the matrix 
of membership degree U and the matrices of positive weights V; and V, are 
kept fixed. The prototype gk = (9x1)... 9«p) of fuzzy cluster k which minimizes 
the clustering criterion (Eq.1) has the bounds of the interval gx; = [az;, Bk] 
(j = 1,...,P) computed according to: 


N N 
_ Lin Vii nd Bye Doin Wik ij (4) 


Qk, j = 
Yi Uik Yli Uik 


Weighting Step: This step provides the solutions for the computation of the 
matrices of relevance weights. During the weighting step, the vector G of pro- 
totypes, and the matrix of membership degrees U are kept fixed. The objective 
function (1) is optimized with respect to the relevance weights. After setting the 
partial derivatives of L w.r.t. Ul kj, Vu,kj and yx to zero and after some algebra, 
the relevance weights are computed as follows: 
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ah 
2P 


{That > Uik(aih — arn)? ja Uik (bin — Br,n)?| } 


Di win (aig — arj)? 


UL, ki = 


a Bi Wir (Qin — ar,n)?| bone Uik (bin — Bed] vr 
Vu,kj = N 


iar Vie (big — bri)” 


(6) 


Allocation Step: This step provides an optimal solution to the computation of 
the matrix of membership degrees of the objects into the fuzzy clusters. During 
the allocation step, the vector G of prototypes, the matrices V; and V, of 
relevance weights are kept fixed. The objective function 1 is optimized with 
respect to the membership degrees. After setting the partial derivatives of £ 
w.r.t. Uik and A; to zero and after some algebra, we obtain: 


{ Ofi loins (aig — rng)” + Vu, (019 — Ba)" ) } 
Tu 


(7) 
ES exp { 2L + Vu nj(dij=Bnj)?] } 


The Algorithm: The AIFCM-ER fuzzy clustering algorithm is summarized in 
Algorithm 1. 


Algorithm 1. AIFCM-ER Algorithm 


Input: The dataset D = {x1,...,xw}; the number C of clusters (2 < C < N) and 
the parameter Tu > 0; the parameter T (maximum number of iterations); the 
threshold £ > 0 and e << 1. 


Output: The vector of prototypes G; the matrix of membership degrees U; the 
relevance weight matrices V; and Vu. 
1: Initialization: Set t= 0; 
Randomly select C distinct prototypes gi” € D(k = 1,...,C) to obtain 
(t) (t)y, 
1.80); 
Initialize the matrices of relevance weights vn = (u) isese with = 
1<j<P 


the vector of prototypes G® = (g 


1 and VË = (ups )iseso with of), = 1, Vk, j; 
1<j<P 
2: repeat 
Seti=t+1 
3 Representation step: Compute G® using Equation 4; 
4: Weighting step: Compute vo and VÝP using the Equations 5 and 6; 
5: Allocation step: Compute U™ using Equation 7; 
6: until Eee = Dona <eort>T 
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3 Experimental Results 


This section aims to evaluate the performance and illustrates the usefulness of 
the AIFCM-ER algorithm by applying it to suitable synthetic and real datasets. 


3.1 Experimental Setting 


The proposed algorithm performance will be compared with two previous fuzzy 
clustering models: the fuzzy C-means for symbolic interval data (I[FCM) and 
the fuzzy C-means for symbolic interval data based on an Adaptive squared 
Euclidean distance between intervals vectors (IFCM ADC) [4]. 

To compare the clustering results furnished by the algorithms four measures 
were used: Fuzzy Rand index (FRI) [7], The Hullermeier index (HU L) [12], the 
Adjusted Rand index (ART) [11] and the F-Measure [1]. From the fuzzy parti- 
tion U = (u,,...,Uc) is obtained a hard partition Q = (Q1,...,Qc), where the 
cluster Qx(k = 1,...,C) is defined as: Q = {i € {1,...,N} : Uik > Um, Vm € 
{1,...,C}}. FRI and HUL indices compare the a priori partition of the syn- 
thetic datasets with the fuzzy partition provided by the algorithms and ARI and 
F-Measure with the hard partition. 

All the interval datasets (synthetic and real) were normalized as follows. 
Let Dj = {x1j,...,cnj} be the set of observed intervals x,; = [aij, bij] on 


variable j(j = 1,..., P). The dispersion of the j-th variable is defined as: 5? = 


= di gj) = =. [(aij = aj)? + (bij = 6] where Jj = [8,85] is the 


N N 

“u ”. . a it i Qij oe Dt bij 
central” interval computed from Dj; as: a; = =w and bj = =. Each 
observed interval x;; is normalized as Tij = [@;,,b;j], where Gj; = we and 

s2 

Éj 
> bobe = — Em = = = ` 
bij = # = , with Tij < biz for all i, j. Therefore D; = {T1j,...,Tnj} and for this 


dataset, one can show that g; = [@,, ß,] = [0,0] and that s? = oe d; (ij, Jj) = 
Ni 0-2 + Oy — 8) =1. 

The choice of the parameter T,, for the proposed algorithm was achieved 
without supervision as follows. For each dataset, the value of T, was varied 
between 1074 to 100 (with step 1074), and the threshold for T,, corresponds to 
the value of the fuzzifier at which the minimum centroid distance falls under 0.1 
for the first time. The parameter m for IFCM and IFCM ADC algorithms was 
set to 1.5 and 2.0. The parameter e was set to 1075, the maximum number of 
iterations T was 50, and for each dataset, the number of clusters was set equal 
to the number of a priori classes. 


3.2 Synthetic Interval-Valued Datasets 


This section investigates with synthetic datasets, performance aspects of the 
AIFCM-ER algorithm. First, two datasets of 150 points in R? were constructed 
to show the usefulness of the proposed method on interval datasets with linearly 
non-separable classes of different shapes and sizes. In each dataset, the 150 points 
are drawn from three bi-variate normal distributions of independent components. 
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There are three classes of unequal sizes and shapes: two classes with an ellipsoidal 
shape and size 50 and one class with a spherical shape and size 50. The first 
dataset shows well-separated classes and the second shows overlapping classes. 
The data points of each class in these datasets were acquired according to the 
parameters showed in Table 1. 


Table 1. Mean (1,2) and standard deviation (01,02) vectors for every class in 
synthetic dataset 1 and 2. 


Dataset 1 Dataset 2 

u | Class 1| Class 2 | Class 3| u | Class 1| Class 2 | Class 3 
kı| 28 60 46 ki | 50 60 52 

pa 22 30 38 Ha | 28 30 38 

cı | 100 9 9 01 | 100 9 9 

02 9 144 02 9 144 


In order to build interval datasets from datasets 1 and 2, each point (21, 22) 
of these datasets is considered as the ‘seed’ of a rectangle. Each rectangle is 
therefore a vector of two intervals defined by: ([zı & ‚214 21, [22 Es , 297 21). 
The parameters 01 and da are the width and the height of the rectangle. In our 
experiments, 6, and 62 are obtained randomly from [1,8]. 

Another synthetic dataset was created using lower and upper boundary con- 
figurations shown in Table 2. For the lower boundary, variables x; and x2 are 
relevant for class 1 and class 2, and variables x2 and x3 are relevant for the class 
3 and class 4. For the upper boundary, all variables are equally relevant for the 
class definition. The purpose of this dataset is to see what happens if the variable 
weight is heavily determined by just one boundary. In this case, the IFCMADC 
algorithm should fail in identifying a cluster structure. 


Table 2. Mean (1, u2, 13) and standard deviation (01, 02,03) vectors for every class 
in synthetic dataset 3. 


Lower boundary configuration Upper boundary configuration 

p | Class 1 Class 2 | Class 3 | Class 4 | u | Class 1 | Class 2 | Class 3 | Class 4 
pi 05 05 0 0 ki | 3.0 4.0 3.5 3.5 

a | —0.5 —0.5 |0.5 0.5 u2 | 3.0 3.0 4.0 4.0 

3 | 0.0 0.0 —0.5 |0.5 3 | 3.5 3.5 3.5 3.5 


01 | 0.04 0.04 1.0 1.0 cı | 1.0 1.0 1.0 1.0 
o2 | 0.04 0.04 0.04 0.04 o2|1.0 1.0 1.0 1.0 
o3 |1.0 1.0 0.04 0.04 o3 |1.0 1.0 1.0 1.0 


In the framework of a Monte Carlo experiment, 100 replications of the pre- 
vious process have been repeated for seeds taken from all datasets. In each 
replication, the algorithm was executed 50 times, and the cluster centers were 
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randomly initialized at each time. The best result for each algorithm was selected 
according to their respective objective function. The average and standard devi- 
ation of the indexes were calculated based on the 100 Monte Carlo iterations. 


Results: Table3 gives the values of the indexes obtained with adaptive and 
non-adaptive distances for the synthetic interval-valued datasets. 


Table 3. Performance of the algorithms on the synthetic interval-valued data. 


Algorithms FRI HUL ARI F-Measure| FRI HUL ARI F-Measure| FRI HUL ARI F-Measure 
Dataset 1 Dataset 2 Dataset 3 

m=1.5 m=1.5 m=1.5 

IFCM 0.7966 0.7966 0.5892 0.7331 0.6901 0.6846 0.3668 0.5868 0.6443 0.5153 0.1125 0.3333 
(std) (0.0207) (0.0210) (0.0461) (0.0284) || (0.0180) (0.0189) (0.0504) (0.0320) || (0.0039) (0.0085) (0.0263) (0.0197) 
m=2 m=2 m=2 

IFCM 0.7456 0.7404 0.6082 0.7442 0.6471 0.6186 0.3842 0.5965 0.6268 0.2904 0.1236 0.3897 
(std) (0.0179) (0.0203) (0.0522) (0.0327) | (0.0126) (0.0163) (0.0485) (0.0313) || (0.0006) (0.0180) (0.0265) (0.0172) 
m=1.5 m=1.5 m=1.5 

IFCMADC 0.9356 0.9382 0.9283 0.9519 0.7763 0.7758 0.5696 0.7139 0.6542 0.5496 0.1393 0.3537 
(std) (0.0135) (0.0133) (0.0389) (0.0261) | (0.0231) (0.0238) (0.0680) (0.0441) || (0.0057) (0.0118) (0.0280) (0.0212) 
m=2 m=2 m=2 

IFCMADC 0.8499 0.8600 0.9274 0.9513 0.7075 0.6972 0.5754 0.7175 0.6277 0.3106 0.1351 0.3883 
(std) (0.0115) (0.0116) (0.0390) (0.0261) | (0.0156) (0.0190) (0.0652) (0.0427) || (0.0011) (0.0212) (0.0281) (0.0185) 
AIFCM-ER 0.9416 0.9453 0.9316 0.9541 0.7766 0.7722 0.5783 0.7194 0.9827 0.9827 0.9537 0.9652 
(std) (0.0526) (0.0471) (0.0353) (0.0236) | (0.0508) (0.0570) (0.0730) (0.0476) || (0.0294) (0.0294) (0.0787) (0.0591) 


As expected, the average indexes are better for the adaptive distances algo- 
rithms. It is also noticed that whatever the index considered, the proposed algo- 
rithm presents the best average performance since in general, can identify clus- 
ters with different structure. Respect to the third dataset results, it is seen that 
the AIFCM-ER algorithm is able also to discover cluster structures also when 
this occurs for not all the boundaries of the interval variables, representing an 
advantage in comparison with previous results reported on the literature. To see 
how the learned metrics help to understand the data, Table4 shows the matri- 
ces of relevance weights of the variables on the fuzzy clusters obtained by the 
IFCMADC algorithm for m = 1.5 and m = 2 and V; and V, for the proposed 
method both on the Dataset 3. 


Table 4. Relevance weights for the IFCMADC and the AIFCM-ER algorithms for the 
Dataset 3. 


IFCMADC (m = 1.5) IFCMADC (m = 2) |AIFCM-ER 
Var. 1/Var. 2|Var. 3 Var. 1 Var. 2|Var. 3|Var. 1 Var. 2 Var. 3 


vi Vul Vi2 Vu2 ME Vus 

Cluster 1/0.7832|1.5333/0.8328 0.9960/1.0991|0.9135/0.3595/0.3453| 8.8843|0.2682|10.4799/0.3226 
Cluster 2 0.4985|2.0149/0.9955 0.9940/1.0994/0.9150/7.2457/0.3581|10.0801/|0.3162| 0.3050/0.3964 
Cluster 3 0.8471/1.7137/0.6889 0.7260/1.2295|1.1203|7.2740/0.3358/|10.2935/|0.3806| 0.3283/0.3182 
Cluster 4/0.9693)1.6952/0.6085 1.0910/1.0850/0.8448/0.3374/0.3743| 9.5973|0.3193| 8.6486/0.2988 


We can observe in the Table 4 that the weights for the upper boundary are 
similar for each cluster for the proposed algorithm. For the lower boundary, 


Fuzzy Clustering, Adaptive Euclidean Distance and Entropy Regularization 703 


variables x2 and x3 have higher values for cluster 1 and 4 and for cluster 2 and 
3, variables x; and z2. These results confirm that the selection of the more influ- 
ent variables, and/or boundaries in the cluster partition improve the clustering 
performance. 


3.3 Symbolic Interval Datasets 


For the purpose of validating the proposed method, we have conducted sev- 
eral experiments on the following datasets of type interval: Car models [6] 
(N = 33, P=8, C=4), City temperature [9] (N =37, P=12, C=4), Freshwa- 
ter fish species [3] (N =12, P=13, C=4), Horses (N =12, P=7, C'=4), Ichino 
[13] (V=8, P=4, C=4) and Wine (N =23, P=21, C=4) symbolic interval 
datasets (in which N represents the number of objects, P represents the number 
of interval-valued variables and C represents the number of a priori classes). For 
each dataset the algorithms were run 50 times and the best results were selected 
according to the minimum value of their objective function. Table 5 presents the 
results provided by the algorithms on the real interval-valued datasets. 


Table 5. Performance of the algorithms on the interval-valued data. 


Algorithms FRI HUL ARI F-Measure||Algorithms FRI HUL ARI F-Measure 
Car models City temperature 

FCM (m=1.5) 0.8240 0.8178 0.5623 0.6667 FCM (m=1.5) 0.7578 0.7592 0.5458 0.6951 
FCM (m=2) 0.7448 0.6871 0.5623 0.6667 FCM (m=2) 0.6768 0.6801 0.5134 0.6710 


FCMADC (m=1.5) 0.8148 0.8109 0.4998 0.6190 FCMADC (m=1.5) 0.7639 0.7674 0.5458 0.6951 
FCMADC (m=2) 0.7644 0.7263 0.5257 0.6371 FCMADC (m=2) 0.6875 0.6978 0.5160 0.6710 


AIFCM-ER 0.7936 0.7638 0.6312 0.7160 |[|AIFCM-ER 0.7460 0.8103 0.5458 0.6951 
Freshwater fish species Horses 

FCM (m=1.5) 0.6621 0.6257 0.2376 0.4324 FCM (m=1.5) 0.6972 0.6868 0.0559 0.2667 
FCM (m=2) 0.6798 0.5539 0.0671 0.2941 FCM (m=2) 0.6900 0.6279 0.0559 0.2667 


FCMADC (m=1.5) 0.7569 0.7569 0.2757 0.4286 FCMADC (m=1.5) 0.7848 0.7824 0.3295 0.4615 
FCMADC (m=2) 0.7332 0.7149 0.2087 0.3704 FCMADC (m=2) 0.7041 0.6604 0.1417 0.3333 


AIFCM-ER 0.9242 0.9242 0.7534 0.8000 ||AIFCM-ER 0.8026 0.7984 0.4272 0.5517 
chino Wine 

FCM (m=1.5) 0.8212 0.8213 0.4444 0.5455 FCM (m=1.5) 0.5745 0.4320 0.0059 0.3026 
FCM (m=2) 0.8131 0.8090 0.4444 0.5455 FCM (m=2) 0.5879 0.3241 -0.0183 0.3828 


FCMADC (m=1.5) 0.8250 0.8250 0.3396 0.4444 FCMADC (m=1.5) 0.5854 0.5126 0.1092 0.4048 
FCMADC (m=2) 0.8301 0.8285 0.3396 0.4444 FCMADC (m=2) 0.5879 0.3243 0.0341 0.4098 
AIFCM-ER 0.9988 0.9988 1.0000 1.0000 |/AIFCM-ER 0.6050 0.5475 0.1306 0.3922 


The obtained results (Table5) show that the proposed method obtain the 
best result for almost all datasets according to FRI and HUL. Concerning to 
the comparison between the hard partitions and the a priori partition, the pro- 
posed method achieves the best results for all datasets. In general, the AIFCM- 
ER algorithm shown that the selection of the more influent variables, and/or 
boundaries in each fuzzy partition, helps to obtain better performance compared 
with previous methods. 


4 Conclusion 


This paper presented a fuzzy clustering algorithm for interval-valued data based 
on adaptive Euclidean distance and entropy regularization. In particular, the 
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algorithm can discover cluster structures also when this occurs for not all the 
boundaries of the interval-valued variables. The algorithm starts from an ini- 
tial fuzzy partition, and then it alternates over three steps (i.e., representation, 
weighting, and allocation) until it converges as the adequacy criterion reaches a 
stationary value. The paper provides a new objective function and updated rules 
are derived. The applications on synthetic and real data confirm the hypothesis 
that algorithms based on adaptive distances are useful to discover non-spherical 
clusters and to perform a selection of the more influent variables, and/or bound- 
aries, in the cluster partition. 
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Abstract. Deep neural networks are an accurate tool for solving, among other 
things, vision tasks. The computational cost of these networks is often high, 
preventing their adoption in many real time applications. Thus, there is a con- 
stant need for computational saving in this research domain. In this paper we 
suggest trading accuracy with computation using a gated version of Convolu- 
tional Neural Networks (CNN). The gated network selectively activates only a 
portion of its feature-maps, depending on the given example to be classified. The 
network’s ‘gates’ imply which feature-maps are necessary for the task, and 
which are not. Specifically, full feature maps are considered for omission, to 
enable computational savings in a manner compliant with GPU hardware con- 
straints. The network is trained using a combination of back-propagation for 
standard weights, minimizing an error-related loss, and reinforcement learning 
for the gates, minimizing a loss related to the number of feature maps used. We 
trained and evaluated a gated version of dense-net on the CIFAR-10 dataset [1]. 
Our results show that with slight impact on the network accuracy, a potential 
acceleration of up to x3 might be obtained. 


Keywords: Neural networks - Pruning + Acceleration 
Conditional computation - Feature-map 


1 Introduction 


The variability and richness of natural visual data make it almost impossible to build 
accurate recognition systems manually. Thus, it is machine learning algorithms which 
dominate these problems today. Deep Neural Networks (DNN) are hierarchical 
machine learning algorithm which currently provide the best results at the fields of 
computer vision, speech processing and Natural Language Processing (NLP). Focusing 
on vision, CNN allow obtaining good solutions for difficult tasks such as image 
classification, object detection/localization, captioning, segmentation and image gen- 
eration. The research regarding CNNs is constantly evolving and the industrial inte- 
gration of these nets increases significantly. 

CNNs are a cascade of convolution, sub-sampling and activation layers which are 
applied on the input. The computational cost of these networks is high, often pre- 
venting their usage in real time applications. The improvement in computer hardware 
and specifically GPUs allow the usage of deeper networks providing more accuracy but 
raises the need for computational saving even more. 

In this paper we present a network that uses only some of its feature maps, chosen 
in an input-depended manner, to classify images. Using the assumption that each 
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feature map of the network allocates and extracts a certain feature from its input [2, 3] 
we assume that given a specific example, only the computation of some feature maps is 
indeed improving the network accuracy. We hence build decision mechanisms, termed 
“gates”, which decide for each feature map in every layer whether the map should be 
computed or not. Since these gates make sharp decisions, we optimize their parameters 
using a reinforcement learning framework. We experiment with a dense-net base 
architecture, which is one of the more accurate contemporary alternatives, on the Cifar- 
10 dataset. Our results indicate that speedups of up to 3x are obtainable with less than 
2% error reduction. 


2 Related Work 


Many studies considered saving deep neural networks’ computational cost. Diverse 
approaches are suggested, including low rank decomposition for convolutional and 
global layers using separable filters [4, 5], tensor decomposition [6, 7], weights and 
activations quantization [8, 9] and implementation using FFT [10]. In this research we 
decrease CNN’s computational cost using conditional computation, and we hence 
focus on this literature. 

Combining conditional computation with networks is often non-trivial since it 
makes the training process difficult. Despite this, the usage of conditional computation 
during train and test time has been studied using several approaches. A CNN that deals 
with dynamic time budget was suggested in [11], allowing output estimation without 
completing the entire forward propagation process, using additional loss layers in 
earlier stages of the network. Although an early classification is obtained, the accuracy 
decreases significantly when having low time budget. 

The model described in [12] suggests selection of a sub-networks combination 
located between stacked LSTMs, in an input-depended fashion. This model aims to 
increase its number of parameters using these sub-networks, thus allow handling tasks 
with many parameters, such as language modeling. This model does not save com- 
putation on tasks such as image classification considered in the present paper. 

Another suggested model is a recurrent neural network which selectively processes 
only some regions of the input, using reinforcement learning methods [13]. Also in [14] 
reinforcement learning technique is used to train a fully connected neural network to 
drop neurons in an input-depended manner. These models use sparse tensors which are 
less compliant with hardware constrains, causing computational saving to be less 
efficient. 

In our model we trained a CNN with bypass connections in an input depended 
manner, such that only the necessary feature-maps are computed. Our results show that 
using this approach allows computational saving with significant test time 
speedup. The method is orthogonal to many methods suggested above [4-7, 10], and 
hence can be combined with them to obtain further acceleration. 
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3 Baseline Network - ‘DenseNet’ 


In traditional networks the layers are sequentially connected one after the other. Dif- 
ferent studies [15, 16] have shown that networks containing shorter connections (by- 
passes) enable deeper architectures which are more accurate. One such recent 
architecture is our network’s baseline — DenseNet [17]. While traditional CNNs’ layers 
are serially connected, in DenseNet the layers are sorted in large blocks, each con- 
taining multiple convolutional layers. Within each block, convolutional layers (fol- 
lowed by ReLU and Batch Normalization (BN)) forward their output maps to all 
subsequent convolution layers within the same block. Transition layers (convolution 
and average-pooling) separate between blocks and decrease the feature map size. 
Deeper layer in the block hence get as input maps from all their predecessors in the 
block, so their input size (number of input maps) increases. To reduce the amount of 
input maps, the output size of all layers is limited to k maps. 

Using the DenseNet architecture, each omitted feature map implies significant 
computation saving, as the map is used as input for all following layers in the block, 
and not only the one next following layer. Our architecture takes advantage of this 
insight to reduce computation amount during test time. 


4 Model 


4.1 Motivation 


As part of the efforts to reduce the deep networks’ computational cost, our model 
prunes feature maps input-dependably. A computation of a feature-map can have 
significant or negligible influence on the output accuracy for a given example, 
depending on its content. For example, classifying dogs and cats is considerably dif- 
ferent from distinguishing trains from tracks, thus these two tasks depend on different 
feature maps. Motivated by this observation, we created an architecture selecting which 
maps to compute and which to omit while classifying a given input. 


4.2 Architecture 


The model is based on the architecture described in [17], and to identify the essential 
feature-maps of the networks, for a given input to be classified, input depended ‘gates’ 
are added. Each gate is associated with a feature-map and indicates if the feature-map 
computation is necessary or not. These gates are binary valued: ‘1’ implies to the 
necessity of using the feature-map, “0” implies it is unnecessary. As shown in Fig. 1, to 
produce these gates we connected to each layer a ‘gates branch’ whose outputs are 
k gate values (one for each map). To keep the net’s computation’s efficiency, the 
computational cost of the gates branches is low. Average pooling with fixed 4 x 4 
output size is applied on the input layers (all the Z preceding layers in the block), 
diminishing the input size of the following fully-connected (FC) layer. The FC layer 
outputs k neurons, each associated with one output map (of the Z + / layer). Following 
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this, BN and sigmoid activation layers are applied, producing probability-like values 
for map computation. Following the sigmoid layer, a stochastic decision is made 
regarding map computation by a Bernoulli trail with the probability of ‘1’ provided by 
the gate. At test time, only the maps whose gate output is ‘1’ are computed. 

At training, to apply the gate decision for information flows on the main branch, 
each output map of the convolutional layer is multiplied by its associated gate value. 
Therefore, a gate’s decision of “not computing a map” causes multiplication of the 
corresponding output map by 0 and avoiding the unnecessary map, while a gate valued 
1 keeps the information of the map unchanged. 


Fig. 1. The unit structure: The unit consists of 3 branches: the main branch, the bypass 
(identity) branch and the gate branch. 


4.3 Optimization 


The loss used for training is the standard softmax loss (the gates’ actions and proba- 
bilities are implicit). During training the weights of the network’s main branch (i.e. the 
DenseNet architecture) are adjusted using standard Stochastic Gradient Descent (SGD). 

Since stochastic decisions are made in the gate branch, the introduced discontinuity 
and non-differentiability prevent optimization of the entire network using SGD. For 
optimization of the gate branches we use a reinforcement learning derivation as in [18] 
and minimize the expected loss while taking expectations also with respect to the 
stochastic decision made. For a single batch with B examples, the loss we minimize to 
encourage map pruning is 


Y AL K B Q 
L- = Aye al t) +2 1 
Gate — K-L-B ( ) 


Where pi; is the map computation probability of map k of layer / in example i. K is 
the number of maps in a layer and L is the number of gated layers of the network. A, t 
and Q are scalar parameters. Raising the probabilities pl; to the power Q> 1 
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encourages diversity of pi; values. The parameter 7 is used to avoid the derivatives from 
obtaining very high values approaching + oo, which may happen otherwise for Q <1. 
This loss is added to the standard Softmax loss to provide the total loss minimized. / is 
used to weight the pruning-related loss function and create a balance between accuracy 
and computational saving. 


5 Experiments and Results 


5.1 Dataset - CIFAR-10 


We evaluate the network on the CIFAR-10 dataset [1], which consists of 60,000 
images of 10 categorical classes: airplane, automobile, bird, cat, deer, dog, frog, horse, 
ship and truck. All the images are RGB sized 32 x 32. The images are divided to a 
train set and a test set: 50,000 and 10,000 images respectively. Followed by [17], we 
used 5,000 images from the training set as a validation set. 


5.2 Training 


We trained the network using a batch size of 64 examples for 300 epochs on a single 
GPU. We set the learning rate to 0.1 over the first 150 epochs, then diminish it to 0.01 
for the next 75 epochs, and again diminish it to 0.001 for the last 75 epochs. We set the 
number of output maps of each unit to be k = 12 to prevent the network from growing 
too wide. The layers are sorted in 3 dense blocks with feature-map sized 32 x 32, 
16 x 16 and 8 x 8, each block contains 12 BN-Relu-Conv units. Followed by [17], 
between the blocks we set a sequence of BN-Relu-Conv-Average-pooling with 2 x 2 
pool size. The weight decay is set to 0.0001, the momentum is set to 0.9 and Q is set to 
6. We initialize all the main branch weights to the final weights of a trained same-sized 
DenseNet. We removed the drop-out layers from the network, since the map-pruning 
produces significant training noise, and the additional drop-out noise disturbs the 
network optimization and convergence. 


Gradual Learning 

In some experiments we trained the network gradually, with each dense block trained 
separately at a time for 300 epochs, using the same parameters stated above. First, for 
300 epochs, only the gate branches’ parameters of the bottom dense block were 
adjusted. Then, for another 300 epochs, the gate branches’ parameters of the second 
block as well as the first block were adjusted. Finally, the third block’s gate branches 
were added to the process, and the entire network was trained simultaneously for 300 
epochs. 


5.3 Results 


We evaluated the networks on the 10,000 remaining test set images. At test time, a 
pruning threshold was set to 0.5 and if the gate probability is above this threshold, the 
map is computed. To evaluate the computational saving potential of our gated version, 
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note that the computation complexity of a convolutional layer with input size 
W x H x din, output size W x H x K and filter size F x F is 


O(W x H x di, x K x F°) (2) 


In our framework, the average probability of a map to be computed is 
P= Eik Fae If both the d;, input maps and the K output maps are not pruned with 
probability P, the complexity of convolutional layer computing is 


O(W x H x Pdi x PK x F?) (3) 


And the speedup resulting from dividing (2) by (3) is 1/P?, i.e. quadratic in P. 
Assuming that the pruning probability is approximately invariant across layers, this 
provides a good estimation of the potential acceleration. We hence compute the 
expected acceleration as 1/P? where P is estimated by averaging Pl ¿ over all the maps 
and all test examples. 

Our main results are shown in Table 1. For 2 = 0, i.e. when the map pruning loss is 
not active, the network prefers to keep almost all its gates at ‘1’, thus using almost all 
the maps. When 4 is raised to 5 our input dependent version can provide significant 
potential accelerations of up to x3, with small accuracy drops of up to 1.7%. 

We compare the results of our input-dependent version with DenseNet networks 
containing less maps. In each row, we compare to a DenseNet with the number of maps 
reduced to get a computational cost comparable to the gated network. This is done by 
choosing kpensener = k : P with P is the gate not-pruning probability. It can be seen that 
the input dependent versions provide an advantage over the simpler alternatives. 


Table 1. Accuracy and potential speedup of gated networks using different train methods and 
parameters. Together with the initial convolutional layer and the transition layers, the networks 
depth is L = 40. 


Training Bias Error Average Potential KDensNet Baseline 
technique rate P speedup error rate 
Standard 4 7.2% 99.95% xl 12 7.2% 
Gradual 1 8% 83.82% x 1.42 10 8.18% 
Standard 1 8.9% 57.73% x3 7 9.34% 


Another advantage of a gated network version is that one can further control the 


speed-accuracy trade-off by tuning the pruning threshold as test time. Hence the same 
network can provide a certain range of speed-accuracy working points, chosen at test 
time according to the application needs. This trade-off obtained by the network of row 
three from Table 1 is shown in Fig. 2. 
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Fig. 2. The potential speed-up and the corresponding error rate using different test threshold 
values. The results are using a network that was trained with 4 = 5 and bias = 1. 


6 Conclusions and Further Work 


We have presented an architecture with input dependent gates, enabling partial com- 
putation of feature maps in an input dependent manner. We showed that such a gated 
version provides convenient accuracy to speed trade off, which is slightly preferable to 
the trade-off obtained with plain DenseNet versions. Beyond that, the gated version 
allows additional accuracy-speed trade-off at run time, hence enabling further flexibility 
when computational constraints are present. 

We currently work on optimizing the model during training using other techniques 
and extending the testing to more datasets. 
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Abstract. In order to monitor the comfort level of the city, which 
depends on several thermal metrics, in many indoor and outdoor appli- 
cations it is required to estimate the comfort level of the city in real-time. 
Out of the many thermal comfort indices proposed so far, predicted mean 
voter (PMV) is one of the widely used measures for both indoor and out- 
door ambiances. Due to the complexity of calculating PMV in real-time, 
many techniques have been proposed to estimate it without using all the 
required parameters. So far fuzzy networks have shown the best results 
for PMV estimation because of its rule generation capability. Convo- 
lutional neural network (CNN) is an deep learning based technique to 
classify, or to estimate particular parameter by shrinking them to sig- 
nificant data-collections. In this work, we fuzzified the system before 
applying CNN for regression to estimate the PMV values. Simulation 
results show that the proposed model outperforms the existing ANFIS 
model for PMV estimation with a lower root mean square error value. 


Keywords: Convolutional neural network - Fuzzy neural network 
Predicted mean vote - Thermal comfort index 


1 Introduction 


In various fields of everyday life such as traveling, going to work, even in indoor 
systems, prediction of comfort level is an important concern of the ambiance. 
A prediction comfort level of certain area or indoor system can have several 
applications such as prediction of travel suitability, prediction of thermal stress 
of residents and helping the workers to decide their working hours. Even in indoor 
one could tune ventilation or air-conditioning according to predictions. Thermal 
comfort is referred to as the condition of mind that expresses satisfaction of the 
thermal environment. It is not only associated with air-temperature, but also 
greatly associated with relative humidity, air velocity, mean radiant temperature, 
metabolic rate, clothing factor along with air-temperature [2]. As a choice of 
thermal comfort index, predicted mean vote (PMV) is a widely used one and 
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uses all the six parameters mentioned previously along with some synthesized 
variables. It is often very difficult to estimate PMV in real time because of the 
complexity, but from meteorological data, it is easy to retrieve only some of 
the parameters mentioned. Methods have been developed to estimate the PMV 
value from a few of the parameters. Fuzzy systems ([15], [16]) has been proved 
to work well in this scenario. 

Convolutional neural network (CNN) has the ability to shrink large chunks of 
data into smaller data containing most key features, which can be used for both 
classification and regression. This work aims at solving the estimation (regres- 
sion) problem with a novel architecture comprising of fuzzy system and deep 
neural network along with analysis of the system leveraging rule synthesizing 
ability of fuzzy systems and estimating ability of CNN. Moreover, the choice 
and inter-dependency of parameters are also demonstrated. 

The remainder of the paper is organized as follows. Basic background is 
presented in Sects. 2 and 3 provides a literature survey. Motivation and problem 
statement are presented in Sect. 4. The proposed approach using fuzzy-CNN 
architecture is discussed in Sect. 5, whereas the input system description and 
functionality of the architecture are explained in Sect. 6. Simulation results and 
analysis are presented in Sect. 7. Finally, the paper is concluded in Sect. 8. 


2 Basic Background 


2.1 Predicted Mean Vote (PMV) 


As defined by Fanger [9], predicted mean vote (PMV) is to scale human sensa- 
tion of thermal comfort, which is backed by ASHRAE [2]. PMV is defined as a 
function of six parameters namely air-temperature (T, in °C or degree Celsius), 
relative humidity (RH in %), mean radiant temperature (Tr in °C or degree 
Celsius), air-velocity (Vair in m/sec), human metabolic rate (Met in W/m?) 
and clothing factor (Clo in Km?W~'). This quantification of thermal comfort 
of a group of persons is defined within a scale of —3 to +3 as shown in Table 1. 


Table 1. Correspondence between PMV indices and PMV labels. 


PMV index|-3 |—-2 |-1 0 1 2 3 
Label Cold | Cool | Less cool | Neutral | Less warm | Warm | Hot 


Equation 1 below presents the PMV as a function of different parameters, 
where W is the external work done (in W/m?), P, is the water vapour pressure 
in Pascal. Ta is the surface temperature of clothing (in °C), h. is the convective 
heat transfer coefficient (in °C) and fa is the ratio of clothed body surface area 
to naked body surface area. 
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PMV = (0.303e”9:036Met 4 0.028) [(Met — W) — 3.05 x 107? x [5733 — 6.99(Met — W) 
— Pa] — 0.42[(Met — W) — 58.15] — 1.7 x 107° Met(5867 — Pa) — 0.0014 x Met(34 
— Ta) — 3.96 x 10 — 8fe x [(Teı + 273)? — (Tr + 273)4] — feihe(Teı — Ta)] (1) 


2.2 ANFIS Model 


The adaptive-network-based fuzzy inference system (ANFIS) model was devel- 
oped by Jang et al. [13] using Takagi-Sugeno fuzzy model [20] to leverage fuzzy- 
rule strength and estimate outputs. For a rule 2, the rule strength (w;) is defined 
as wi = pi (11) X (za)... X up (1,), where yu, is the membership function for 
input x; and the rule i. In ANFIS model, then the rule strengths are normalized 
(w;) and put into linear combination with input values in order to get output y 
as given by Eq. 2, where a; is called the consequent parameter. 


y= S wi (ao + aızı + 4923 +... + Ann) (2) 


(3 


2.3 Convolutional Neural Network (CNN) 


A convolutional neural network (CNN) is similar to a generic feed-forward arti- 
ficial neural network (ANN) except that they are specifically used to shrink or 
“convolve” the input data into a lower dimensional data-form for further use. 
The hidden layers of a CNN typically consists of three types of layers: convolu- 
tional layer, pooling layer and fully-connected (FC) layer. In the convolutional 
layer, a window of randomly initialized values is applied to convolve with the 
part of input data having identical dimension; while the window is slid by some 
predefined value. Pooling layer transforms the region of input into a singular 
value (stride), which is generally done by taking maximum (max-pool), min- 
imum (min-pool) or average (average-pool). In a fully-connected (FC) layer, 
every neuron in previous layer is connected to every neuron in the next layer. 
This layer is generally applied after convolving and/or pooling in order to obtain 
the classification or regression value. The relative positioning and deciding num- 
ber of layers are specific to a problem scenario. 


3 Related Work 


Various ways of estimating the thermal comfort levels proposed by researches 
include simple version of comfort index [6] and weight-based or weighted comfort 
index [19], predicted mean vote (PMV) [9], physiological equivalent temperature 
(PET) [12], standardized PMV (SPMV) [10]. Among all these, PMV is the widely 
used measure for thermal comfort index. Techniques like neural network, vector 
machine [14], fuzzy set, genetic algorithm, etc. have been used to enhance the 
accuracy of estimating the PMV value both in indoor and outdoor scenarios. 
Ciglar et al. [5] showed a model-predictive control framework. Another work 
presents how PMV zones in an outdoor environment of a district in Italy are 
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analyzed [11]. Similarly, PMV is also used for analysis of an outdoor environment 
during urban planning [4]. 

Neural networks can learn the error in parameters in a converging manner, 
while fuzzy sets are able to distribute the parameters over intervals in order to 
express the result in a realistic manner. Combination of both have been lever- 
aged in various works. Li et al. [15] proposed a type-2 fuzzy set based neural 
network to estimate the PMV value and also used back-propagation to adjust 
the membership function parameters. Yifan et al. [16] developed a simple fuzzy 
neural network model with the 6-parameter variation and 4-parameter variation 
(excluding metabolic rate, clothing factor) using multivariate regression for esti- 
mating PMV and obtained excellent results in term of accuracy of prediction. 
Popko et al. [18] used fuzzy logic module along with CNN for handwritten digits 
classification. Moreno et al. [17] combined CNN and a final fuzzy layer to achieve 
classification for object recognition. Zhou et al. [21] attempted regression using 
CNN to estimate the pain of certain facial expressions in video data. 


4 Motivation and Problem Statement 


4.1 Motivation 


CNN is primarily used for classification using deep layer techniques, converting 
large data into to smaller and significant data-chunks. On the other hand, in 
order to extend the use of PMV value as comfort index in both indoor and out- 
door environment, finding a different technique for better estimation of PMV is 
a challenge. Li et al. [15] and Yifan et al. [16] were able to achieve significant 
accuracy with root-mean-square-error (RMSE) values as 0.2 and 0.045, respec- 
tively, using fuzzy sets and neural networks. A know fact that the deep neural 
network architectures like CNN can also be used to perform regression-like tasks 
motivates us to leverage the advantages of both CNN as regression model and 
fuzzy-set for better estimation of the PMV value. 


4.2 Problem Statement 


Here the problem is to have an efficient and effective method to estimate the 
PMV value from a few known parameter values. In this paper, an attempt is 
made to use CNN for regression along with fuzzy sets for PMV estimation. This 
finding also incorporated selecting important parameters. 


5 Proposed Approach: Fuzzy-CNN Architecture 


We adopt the ANFIS model (without any prior knowledge of rules), while the 
estimation of consequent parameters (a; as in Eq. 2) is left to the CNN lay- 
ers. Regression using neural networks is a widely used practiced approach. In 
case of 5-parameter case, five parameters are considered to estimate the PMV 
value namely air-temperature (Ta), relative humidity (RH), air velocity (Vair), 
metabolic rate (Met) and clothing factor (Clo). Whereas, for 6-parameter case, 
mean radiant temperature (Tr) is included as the sixth parameter. 
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5.1 Pre-processing 


Six parameters are distributed into multiple fuzzy sets using standard Gaussian 
(2¡-al)? 
$ 2(b1)2 

distribution function (4) that can be calculated as ju} (x;) = e l , where 
yl is membership function for input x; and the corresponding rule 7; while a 
and b are corresponding mean and standard deviation (s.d.) of the distribution, 
respectively, for x; and j. Figure 1 depicts the initial fuzzy distributions step for 
air-temperature, relative humidity and air velocity. The initial pre-processing 
steps of all the six parameters are distributed into three fuzzy sets as follows. 
Air-temperature is distributed into three fuzzy sets: cold, normal (with higher 
s.d., i.e., flat/spread curve for the two extreme sets), hot as shown in Fig. 1(a). 
For relative humidity Gaussian functions are used to split into 3 sets: humid, 
normal, dry as shown in Fig. 1(b). Very low s.d. is applied to the two extreme sets 
while flat curve was maintained for normal one. For air velocity, again Gaussian 
distribution is used to divide into three sets: stormy, moderate air flow, almost 
still air (giving moderate s.d. in two extreme sets and high s.d. in the median set) 
as shown in Fig. 1(c). The mean radiant temperature Tr is distributed in same 
way as Ta. The metabolic rate is divided into three sets: slow, moderate, active 
giving moderate set with a high variance. Finally, the clothing factor is divided 
into three sets: heavily clothed, normal and minimal clothing. Heavily clothing 
is given low variance while moderately clothing set is given high variance. 
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Fig. 1. Initial fuzzy distribution of (a) air-temperature, (b) relative humidity and (c) 
air velocity. 


5.2 Layer Architecture 


Initially, all the five parameters are divided into three fuzzy sets, i.e., a total of 
15 fuzzy-sets are obtained. The proposed fuzzy-CNN architecture consists of five 
layers as depicted in Fig. 2 and discussed here layer-wise. 
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Fig. 2. Overall architecture of proposed fuzzy-CNN based model. 


Layer 1: Using the pre-processed fuzzified values, the rule combinations are gen- 
erated. Each rule is considered to be a tuple of five values for the 5-parameter 
case (six values for the 6-parameter case), where each value is corresponding 
to the particular fuzzy-set value of one parameter. Hence, a total of 243 (729) 
rules are generated for 5-parameter (6-parameter) case. One sample rule j is 
“if xı is pi (21), £2 is p3(z2), ..., Lp is (zp), then output is y”, where x;s 
are total p input parameters and us are the corresponding membership func- 
tions. For 5-parameter case, combinations are generated as follows: if there are 
two sets A = |a;,a2] and B = [bj, bg], then their ordered combinations will be 
[(a1, b1), (a1, b2), (a2, b1), (a2, ba)). 


Layer 2: Each rule values are intra-multiplied in order to obtain the rule- 
strength (w;) as discussed in Sect. 2.2. 


Layer 3: Each rule strength value is normalized (w,) as given by Eq.3, where 
j ranges from 1 to 243 (729) for the 5-parameter (6-parameter) case. 


Wj 


> Wj 


(3) 


Wj = 


Layer 4: The input data tuple (dimension 5x 1) is transformed into a 7 x 1 sized 
tuple by appending the average of five parameters and numeric value integer 1. 
This appending can be regarded as bias term ay as mentioned in Eq. 2. Then, the 
output of layer 3, i.e., the normalized value of each rule (size 1 x 1) is multiplied 
with the transformed tuple to get the expanded rule value of 7 x 1. 


Layer 5: In order to get the parameters or to estimate y of Eq. 2, deep net- 
works are incorporated. First, generic 3 layer neural network with RMSProp 
optimizer [3] is used with a learning rate of around 0.0005 to find y from 1701 
(= 243 x 7) parameters for the 5-parameter case and 5832 (= 729 x 8) for the 
6-parameter case. Later on for the 5-parameter case, it is compared with the 
deep architecture consisting of a 7 x 1 (8 x 1 for the 6-parameter case) convolve 
layer with one channel and stride of 7 (8 for the 6-parameter case) units, followed 
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Fig. 3. Internal architecture of layer 5 of the proposed fuzzy-CNN based model. 


by a3 x 1 max-pool layer and stride of 3 units, which is again passed through a 
convolve layer of size 1 x 1, 3 channels and unit stride, followed by the max-pool 
layer same as the last one. This is followed by three fully-connected (FC) layers 
with 500, 250 and 50 neurons, respectively. This entire architecture of layer 5 
for 5-parameter case is depicted in Fig. 3. 


5.3 Choice of Parameters 


The mean radiant temperature (MRT) is related to the air-temperature accord- 
ing to ISO 7726 standard [1]. Considering MRT as one of the input parameters 
would grow the number of rules to significantly large number (729 x 8 = 5832 
for the 6-parameter case) of consequent parameters reached, which is a three- 
fold increase in terms of parameter estimation. The metabolic rate and clothing 
factor were unavoidable as shown by Yifan et al. [16], whereas air-temperature, 
relative humidity and air-velocity are maintained as the key parameters. 


5.4 Input System 


For experimenting and analysis, RP-884 are used as reference, where the 
datasets for one NV building by Dear et al. [7] and 22 HVAC buildings by 
Cena et al. [8] are combined as the entire dataset. The former one was obtained 
from wet equatorial climate of Singapore, in the year 1991 and the latter one was 
from hot arid region of Kalgoorlie-Boulder, Australia for both winter and sum- 
mer seasons in 1998. The Singapore and Australian winter and summer datasets 
has 584, 625 and 589 samples, respectively, totaling 1798 samples; out of which 
around 1400 samples were used for training and 400 for testing randomly at 
runtime. The parameters used in simulation are in the range as follows: air- 
temperature from 16.7 °C to 36.1 °C, relative humidity from 24.54 % to 97.82 
%, air velocity from 0.043 m/s to 1.567 m/s, mean radiant temperature from 
16.82 °C to 32.81 °C, metabolic rate from 0.772 Met to 2.58 Met (where 1 Met 
= 58 W/m?), and the clothing factor from 0.045 to 1.57. 
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6 Deep Layer Functioning 


As discussed in Sect. 2, convolutional layer performs dot product and hence 
results in downsampling of input. In the 5-parameter case, the input dimension 
is 1701 x 1 x 1 x 1 (height x width x depth x channels). This is obtained by 
expanding all 243 rules with 7 x 1 sized tuples as described earlier (243 x 7 = 
1701). After convolving with 7 x 1 sized filter with a stride of 7 x 1 x 1 
x 1, expanded rule values are reduced to a singular value meaning, i.e., into 
243 parameters for the 5-parameter case. Similarly, in the 6-parameter case the 
expanded rule values are reduced to 729 parameters. This value can be considered 
as related to the normalized rule strength from layer 3. This reduction is similar 
to layer 4 to 3 (backward), but in a different way. Max-pooling with window 
size 3 x 1 x 1 x 1 downsamples every 3 consecutive values into a singular one 
(maximum one). Before normalizing in layer 3, in layer 2 every 3 consecutive 
rule strengths differs only in terms of the values of clothing factor (Clo) as 
a membership function. Here, this pooling step is kept as the maximum (not 
an average) as it would be easier for optimization. Then it is reduced to 81 
parameters for the 5-parameter case and 243 parameters for the 6-parameter 
case. The next convolve layer performs dot product with each value obtained in 
the last step, but adds 3 channels to it making it 81 x 3 sized data. Reshaping 
this data produces to 243 x 1 shape again. Furthermore, the max-pooling reduces 
it to 81 values for the 5-parameter case (243 values for the 6-parameter case), 
which imply getting rid of the effect of parameter clothing factor (Clo). One more 
layer of max-pooling of similar dimension and stride reduces it to 27 parameters, 
this can be considered as neutralizing the human metabolic rate (Met). Adding 
layers to it affects the results and time to train the model. The second convolve 
layer is added to make an increase in the number of parameters to optimize 
and pass different values to second pooling layer. A few fully-connected (FC) 
layers those are added to it start having neurons almost 20-fold of the number of 
parameters (27 x 19 = 500). Weights and biases of these FC layers are initialized 
with the random normal values. 

These rules basically boil down to a normalized value (Eq. 2). Those are first 
expanded (refer to layer 4 of Sect. 5.2) and then compressed through the CNN 
architecture in layer 5. The last max-pooling layer converts it from 243 to 27 for 
the 5-parameter case (from 729 to 81 for the 6-parameter case), which is a fairly 
scalable size that the Tensorflow fully-connected nets can handle. Compressing 
those many rules using the proposed CNN architecture is important step to 
tackle the combinatorial explosion of many fuzzy rules. 


7 Simulation Results 


The proposed method is implemented using Python 3.2 and Tensorflow and 
NumPy. Initially, 3 FC traditional neural network layers are appended to layer 
4 of the ANFIS model, which results in best root mean-squared error (RMSE) 
value of around 0.8 on the test dataset. The proposed fuzzy-CNN model is fed 


722 A. Mitra et al. 


with the train and test dataset as mentioned in Sect. 5.4. It provides a good 
RMSE value of around 0.018 for the 5-parameter case and around 0.08 the 5- 
parameter case, considering no prior knowledge were used in both the cases. The 
ANFIS model with prior knowledge for the 6-parameter case with multivariate 
regression reaches the best RMSE value of 0.04. 

The error plot in Fig. 4(a) shows the variation of actual and predicted PMV 
values for the 5-parameter case for first 100 test samples. The shrinked size of 
most of the data points indicate closeness of predicted and actual PMV val- 
ues. Figure 4(b) represents the amount of error in the predicted value for each 
sample for both the 5-parameter case and 6-parameter case of proposed model 
(i.e., F-CNN without MRT and F-CNN with MRT, respectively, where MRT 
is mean radiant temperature) and ANFIS model with prior knowledge for the 
6-parameter case. The error in the 6-parameter case deviates the most from 0. 
Over 400 samples, the RMSE values are around 0.02 and 0.08 for the 5-parameter 
and the 6-parameter case, respectively, while for ANFIS model [16] with prior 
knowledge it is around 0.04. The consideration of MRT reduces the accuracy. 

It is also observed that RMSProp is able to converge slowly but more effi- 
ciently (with global minima < 0.019) for both the 5-parameter and 6-parameter 
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Fig. 4. Variation of (a) predicted and actual PMV values, and (b) relative error of 
three approaches with varying number of samples. 
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Fig. 5. Comparison between RMSProp and GD optimzer for (a) 5-parameter case and 
(b) 6-parameter case. 
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Fig. 6. Final fuzzy distribution of (a) air-temperature, (b) relative humidity and (c) 
air velocity. 


cases, while the gradient descent (GD) optimizer converged quickly but with a 
higher global minima > 0.021. Here, the learning rate is maintained as 0.0005 
and batch size was maintained as 5. It is observed that increasing the batch size 
does not have significant effect on global minima except, it converged at slower 
rate. Figure 5(a) and (b) show how the RMSE converges against iterations for 
the 5-parameter and 6-parameter case, respectively, with both the optimizers. 
Figure6 shows the finally tuned fuzzy set values for the three parameters 
air-temperature, relative humidity and air-velocity, respectively. 


8 Conclusions 


Predicted mean voter (PMV) is a widely used thermal comfort index. In this 
paper, we have proposed a novel method based on fuzzy convolutional neural 
network (F-CNN) model to estimate the PMV values. This proposed model out- 
performs the existing model for PMV estimation with a lower root mean square 
error value. It is found that CNN can efficiently deduce the inter-dependencies of 
the parameters and their impact in estimating the final PMV values. In future, 
this work can be extended by incorporating selective networks like restricted 
Boltzmann machine or belief networks in order to obtain better accuracy in 
PMV value estimation. 
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Abstract. It is a fact that due to the war in Syria and to instability/poverty in 
wide regions of the world, immigration flows to Europe have increased to a very 
significant extent. From the EU countries, Greece and Italy are accepting the 
heaviest load due to their geographical location. This research paper, proposes a 
flexible and rational Soft Computing approach, aiming to model and classify 
areas of the Greek (sea and land) borderline, based on the density and range of 
illegal immigration (ILIM). The proposed model employs Intuitionistic Fuzzy 
Sets (FUS) and Fuzzy Similarity indices (FUSI). The application of this 
methodology can provide significant aid towards the assessment of the situation 
in each of the involved areas, depending on the extent of the flow they face. 


Keywords: Illegal immigration - Intuitionistic fuzzy sets 
Degrees of membership - Degrees of non-membership - Similarity indices 
Classification 


1 Introduction 


To the best of our knowledge, this is the first and pioneer Soft Computing approaches 
employing Intuitionistic Fuzzy Sets, towards illegal immigration risk modeling for 
Greece. This was achieved by employing Fuzzy Algebraic approaches, offering the 
most flexible and effective solution for the representation and modeling of real world 
concepts (e.g. “high temperature”, “small rain height”, “high altitude”). From this point 
of view this research effort has a certain level of innovation. 

Fuzzy Logic constitutes a part of Soft Computing, a branch of Artificial Intelligence 
that is used in many scientific applications, like control systems and Hybrid Decision 
Support systems. It is widely used in risk estimation. However, the most important 
innovative element of this research is the introduction of a new risk estimation 
approach, employing Intuitionist Fuzzy Sets in order to enhance flexibility. This is 
really important for totally unstructured problems like the one faced herein. 

This research proposes several annual local ILIM risk models for a period of eight 
years (2010-2017). These models were compared to each other and analyzed thor- 
oughly. The result was the estimation of cross checked indices, regarding annual illegal 
immigration risk (ANIIR) similarities and differences, among the areas of entry. The 
areas considered are the following: Greek - Albanian border, Greek - FYROM border, 
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Greek - Bulgarian border, river Evros (natural border between Greece and Turkey) the 
islands of Lesvos, Samos, Crete, the islands of Dodecanese and Cyclades and finally 
the rest of the country. The classification of all areas based on the produced metadata, 
was performed through an innovative and flexible algorithmic approach. More atten- 
tion was given to the area of river Evros and to the island of Lesvos as they are the 
major entry points, carrying a major part of the illegal immigration flow. 


2 Materials and Methods 
2.1 Theoretical Background — Methodology (Hung and Yang 2004) 
According to Athanassov (1999), an IFUS A is defined as follows: 
A = {(x, 1400, va(x)| x € X} (2.1) 


where j14(x),vq(x) € [0, 1] denote the degree of membership and the degree of non- 


membership of x € A respectively. 
The following condition must be met: 


0< ui (a) +vi (0) <1 Vx e X (2.2) 


For each IFUS A in the universe of discourse X , 


male) = 1 = 60) — pee (2.3) 
and 
nala) = 1 = a(x) — val) (2.4) 


is called the hesitancy degree of x to A and it satisfies the inequality 
O<ma(x) <1 Ve EX. 
The following membership function has been used: 


o? 


u(x) = 0.76 22 (2.5) 


A very important aspect of IFUS is the estimation of their degree of similarity 


(DESI). The DESI S between two IFUS A and B can be calculated with various 
functions such as: 


e (Li and Cheng 2002): 


AB) = 1 — 4 mal) = mpl (26) 
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my (i) = (Mai) +1= ee)? (Q.7) 


and 1<p<oo. 
e (Liang and Shi 2003): 


si (4, B) = 1 — me (D asli) + DE) (2.8) 
where 
Dag (1) = I) — Mari) /2 (2.9) 
and 
Pas) = |0 — va ())2 — (1 — va (1)) /2| (2.10) 


The degree of similarity S for all IFUS A, Band C satisfies the following 
properties: 


0<S(A,B) <1 (2.11) 

S(A,B) =1if A=B (2.12) 

S(A,B) = S(B,A) (2.13) 

S(A, Č) < S(A, B) and S(A, Č) < S(B, Č) (2.14) 


A new edition of property (2.12) is (2.15) (Mitchell 2003): 


S(A,B)=1 if and only if A=B (2.15) 


2.2 Data 


The data used in this research, were obtained from the official website of the Hellenic 
Police: http://www.astynomia.gr/newsite.php?&lang. 


3 The Proposed Fuzzy Intuitionistic System 


The proposed model estimates the degree of membership of each area to the Linguistics 
“Low Risk”, “Moderate Risk”, “High Risk” separately, based on the number of inci- 
dents. It is an indirect multiclass classification. Also the intuitionistic fuzzy sets (INFS) 
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are used towards the estimation of the degree of similarities between the most risky 
areas. 

Initially, the raw data were stored in MS Excel. The global range of the problem 
was found by obtaining the minimum and the maximum number of ILIM for the whole 
country and for every year. 


File Edit View 


Untitled 


(mamdani) 


Defuzzification 


System “Untitled”: 1 input. 1 output. and O rules 


Screen 1. Fuzzy toolbox of MATLAB 


Then, the raw data were transferred in proper tables in MATLAB, and three 
Membership functions corresponding to the three fuzzy Linguistics mentioned above, 
have been defined automatically by the Fuzzy Toolbox of MATLAB after the input of 
the range (minimum and maximum values). For all three linguistics Gaussian mem- 
bership functions were employed. Screens 1 and 2 are the graphical user interface of 
the Fuzzy toolbox of MATLAB. 


Screen 2. Gaussian membership function 
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4 Results and Discussion 


4.1 Comparison Between the River Evros Area and Island of Lesvos 


The river Evros area has been chosen because it is the natural main land border 
between Greece and Turkey and the island of Lesvos is a characteristic destination that 
can be easily reached by sea from the Turkish coast, using small boats. 

The following chart resulted after the Analysis of Fuzzy Sets and the usage of 
Intuitionistic Fuzzy Sets and it presents the degree of Similarity between the two areas. 


2010 2011 2012 2013 2014 2015 2016 2017 


= EVROS Be LESVOS 


Graph 1. The characteristic categories of risk according to the number of illegal immigrants for 
Evros and Lesvos (2010-2017) (0 low risk, 0.5 moderate risk, 1 high risk) 


As we have seen in 2016, the risk for Lesvos has started dropping (maybe because 
there is no place in the refugee camps of Lesvos any more, and the refugees have 
realized this). The same time the risk for Evros has started rising as the people were 
looking for a new entry to Greece that can lead to Athens or to western Europe. The 
very impressive in this case is the fast that the two lines almost met in 2017 and the two 
areas have the same DOS. 


Comparison 1 

According to the number of ILIM incidents, for the years 2010-2012, the two above 
areas do not have the same behavior and the same risk. Evros is a case of “High Risk” 
whereas Lesvos is a case of “Low Risk” for the above years. Their degree of similarity 
(DOS) is Moderate. All other degrees are almost equal to zero. 


Table 1. Risk similarity for 2010-2012 


Degrees of risk similarity (Lesvos-Evros) | 2010 2011 2012 

Low 0.017103 | 9.55E-05 | 0.024268 
Moderate 0.977549 | 0.998743 | 0.971531 
High 2.28E-07 2.03E-07 | 8.95E-07 


As it can be seen in Table 1 the degree of similarity is not high. 
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Comparison 2 

For the year 2013, the two above areas have almost the same behavior for all risk 
Linguistics as they have a very high degree of similarity and they are both “Low 
Risky”. However, Evros is 7.7 times closer to the category of “Moderate Risk”. We 
examine the year 2013 separately, because as we will see below 2013 is the start of a 
new era (Table 2). 


Table 2. Risk similarities for the year 2013 


Degrees of risk similarity (Lesvos-Evros) | 2013 

Low 0.583859885 
Moderate 0.662525807 
High 0.999935607 


Comparison 3 

For the year 2014, the two above areas do not have the same behavior and the same risk 
according to the number of illegal immigrants. Evros is a typical case of “Low Risk” 
and Lesvos is a case of “Moderate Risk”. This is due to the fact that (as we will also see 
in Graph 1) the Risk level for Lesvos has started rising seriously, whereas the Risk for 
Evros has started dropping for some years in the row (Table 3). 


Table 3. Risk similarities for the year 2014 


Degrees of risk similarity | 2014 

Low 0.094242768 
Moderate 0.373050006 
High 0.763732614 


Comparison 4 

If we compare the two areas for 2015 and 2016, we see that they do not have the same 
behavior and the same risk according to the number of illegal immigrants. Evros is still 
a typical case of “Low Risk” and Lesvos is a characteristic case of “High Risk”. The 
refuges keep preferring the seaway and they keep risking their lives by using small 
rotten boats to reach the islands, preferring mainly the island of Lesvos which is very 
close to the Turkish mainland and it is also a major one. The 12 km fence that has been 
built by Greece in the north land border with Turkey has plaid a significant role towards 
this situation (Table 4). 


Table 4. Comparison of risk similarities for 2015-2016 


Degrees of risk similarity | 2015 2016 

Low 0.000351104 | 0.009087607 
Moderate 0.997457414 | 0.984788473 
High 1.04915E-07 | 2.75765E-07 
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Comparison 5 

For the year 2017, the two above areas have almost the same behavior for all risk 
Linguistics as they have a high degree of similarity and they are both “Moderate Risky”. 
However, Lesvos is 1560 times closer to the category of “Moderate Risk” (Table 5). 


Table 5. Comparison of risk similarities for the year 2017 


Degrees of risk similarity | 2017 

Low 0.541804804 
Moderate 0.678369381 
High 0.823298925 


5 Conclusion-Discussion 


After analyzing the results, we see that the degree of risk of the two pilot areas (Lesvos 
and Evros) changes on an annual basis and it ranges from high to low, depending on 
the orientation of the immigrants’ flow. It proves that the desperate immigrants who left 
their countries to escape from war, are sometimes motivated to use land borders and 
sometimes to cross the sea. This depends on the information they get and on the 
interests of the people who exploit this unfortunate situation in order to make profit. 
Another measure that motivated immigrants to cross the sea to the islands after 2013 is 
the construction of the fence in river Evros from the Greek side. The construction of 
this fence finished in 2014. In Fig. 4.1 we see that from 2013 (when the works for the 
fence had started) till 2016 the ILIM incidents from the land border (Evros) has 
dropped to an extremely Low level whereas the incidents in Lesvos have increased 
dramatically. In 2017 the tendency of the refugees was to use both diodes equally. The 
research has shown that the immigration of the refuges is not done randomly, but it is 


Table 6. Comparison of degrees of immigration risk for Greece for 2010 and 2011 


Areas Years 

2010 2011 

L M H L M H 
Albanian 1E-04 | 0.432 0.25739 | 0.4557 0.24 |2E-05 
FYROM 0.986 | 0.021 7.8E-08 | 0.9947 0.0177 | 5E-08 
Bulgarian 0.996 (0.017 5E-08 [0.998 | 0.0158 | 4E-08 
Evros 3E-08 | 0.013 1 3E-08 0.0132 1 
Lesvos 0.974 (0.025 1.1E-07 | 0.9999 0.0138 | 3E-08 
Samos 0.991 (0.019 6.6E-08 | 0.9996 0.0143 | 4E-08 
Chios 1 0.014 | 3.4E-08 | 1 0.0133 | 3E-08 
Dodecanese 0.988 | 0.021 7.5E-08 | 1 0.0132 3E—08 
Cyclades 1 0.013 | 3E-08 | 0.9999 | 0.0137 3E-08 
Crete 0.964 | 0.028 1.4E-07 | 0.9854 0.0215 | SE-08 
Rest of Greece | 2E-06 | 0.099 | 0.73042 | 0.0094 | 0.9937 0.0182 
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organized and the desperate ILIM are following specific roots that changes annually, 
depending on the policies of the involved countries. 

The following Tables 6, 7, 8 and 9 present the fuzzy degrees of membership of all 
border areas of Greece to the Linguistics Low, Medium and High ILIM Risk. 

Evros is a typical case of High Risk with a Degree of Membership (DOM) equal to 
1 for both 2010 and 2011 whereas Lesvos is Low Risky with DOM practically equal to 
1 (0.99). The other areas are all Low Risky compared to these two spots. 


Table 7. Comparison of degrees of immigration risk for Greece for 2012 and 2013 


Areas Years 

2012 2013 

L M H L M H 
Albanian 0.108 | 0.708 0.0008 | 0.0009 | 0.7191 | 0.1028 
FYROM 0.975 (0.025 1.1E-07 | 0.9447 | 0.0336 | 2E-07 
Bulgarian 0.998 | 0.016 4.4E-08 | 0.99 [0.0198 | TE-08 
Evros 3E-08 | 0.013 1 0.9349 | 0.0363 | 2E-07 
Lesvos 0.964 | 0.028 | 1.4E-07 | 0.4069 | 0.2774 | 3E-05 
Samos 0.979 | 0.024 | 9.7E-08 | 0.5241 | 0.1959 | 1E-05 
Chios 1 0.014 | 3.2E—08 | 0.8705 | 0.054 | 6E-07 
Dodecanese 0.98 | 0.023  9.4E-08 | 0.7049 | 0.1088 | 3E-06 
Cyclades 1 0.013 3E-08 |1 0.0132 | 3E-08 
Crete 0.862 | 0.057 6.4E-07 | 0.6728 | 0.1218 | 4E-06 
Rest of Greece | 7E-07 | 0.058 | 0.85442 | 3E-08 | 0.0131 | 1 


Evros is still a High-risk area for 2012 and suddenly the risk drops to zero (Low 
Risk area with the maximum DOM) for the next year 2013 whereas Risk for Lesvos 
starts rising and the Rest of the country becomes High Risky with DOM equal to 1. 


Table 8. Comparison of degrees of immigration risk for Greece for 2014 and 2015 


Areas Years 

2014 2015 

L M H L M H 
Albanian 0.013 |1 0.0138 | 0.9958 | 0.0172 | SE-08 
FYROM 0.952 | 0.032 1.8E-07 | 1 0.0134 | 3E-08 
Bulgarian 0.986 0.021 |8E-08 |1 0.0133 | 3E-08 
Evros 0.863 | 0.056 | 6.3E-07 | 0.9995 | 0.0145 | 4E-08 
Lesvos 5E-04 | 0.627 0.14156 | 3E-08 | 0.0132 | 1 
Samos 0.054 | 0.871 0.00243 | 0.4916 | 0.2159 | 2E-05 
Chios 0.122 | 0.672 0.00064 | 0.3873 | 0.2937 | 4E-05 
Dodecanese 2E-06 | 0.083 0.77688 | 0.3233 | 0.3545 | TE-05 
Cyclades 1 0.013 3E-08 |1 0.0132 | 3E-08 
Crete 0.65 0.132 | 4.6E-06 | 0.9997 | 0.0142 | 3E-08 
Rest of Greece | 3E-08 | 0.013 1 0.9616 | 0.0289 | 2E-07 
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In 2014 the problem is in recession (no area is High risky) whereas in 2015 all of a 
sudden, only Lesvos is assigned DOM equal to 1 (extreme value) for the Linguistic 
High Risk. 

It is remarkable that is 2016 only Lesvos still remains High risky with the maxi- 
mum Dom equal to 1. Suddenly the situation changes completely in 2017, where many 
areas are assigned the Moderate Risk linguistic. These areas are: Lesvos, Samos, Chios, 
Evros and the surprise is the Greek Albanian border. It should be mentioned that 
Lesvos (though moderate risky is still the most risky one, followed by the Albanian 
borders and the island of Chios. 


Table 9. Comparison of degrees of immigration risk for Greece for 2016 and 2017 


Areas Years 

2016 2017 

L M H L M H 
Albanian 0.95 | 0.032 1.9E-07 | 0.1284 | 0.6568 0.0006 
FYROM 1 0.014 | 3.3E-08 | 0.9996 | 0.0142 | 4E-08 
Bulgarian 0.999 (0.015 4.1E-08 | 0.9632 0.0284 | 1E-07 
Evros 0.986 | 0.021 7.8E-08 | 0.3269 | 0.3506 | TE-05 
Lesvos 3E-08 | 0.013 1 0.0009 0.7246 0.1012 
Samos 0.68 0.119 3.6E-06 | 0.3183 0.3597 | TE-05 
Chios 0.053 | 0.875 0.00252 | 0.1974 0.5211 0.0002 
Dodecanese 0.506 | 0.207 1.5E-05 | 0.5321 | 0.1909 | 1E-05 
Cyclades 1 0.013 |3E-08 |1 0.0132 | 3E-08 
Crete 0.998 | 0.016 4.5E-08 | 0.9182 0.0407 | 3E-07 
Rest of Greece | 0.658 |0.128 4.3E-06 | 3E-08 | 0.0132 1 


All of the above conclusions are very interesting, and they show the roots that the 
immigrants decide to follow every year. It is impressive that the vast majority of them 
are guided to follow specific roots that change depending on the situations. This data 
and results will be much more interesting to the people who have a clear view of the 
situation and they can explain the root changes on an annual basis. Of course, the life 
conditions of the immigrants must be improved. 

Future research will focus on the estimation of similarities among all border areas 
of Greece. 
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Abstract. A basic building block in the foundation of fuzzy neural networks is 
the theory of fuzzy implications. Fuzzy implications play a crucial role in this 
topic. The aim of this paper is to find a new method of generating fuzzy 
implications. based on a given fuzzy negation. Specifically, we propose using a 
given fuzzy negation and a function so as to generate rules of fuzzy implications, 
that is rules which regulate decision making, thus adapting mathematics to 
human common sense. A great advantage of this construction is that the 
implications generated in this way fulfil many axioms and serious properties 
among the set of required ones. 


Keywords: Fuzzy implication - Fuzzy negation - t-norm - t-conorm 


1 Introduction 


Everything starts from the well-known connection «if ...... then ......... », which is s 
called implication in mathematic. In the above reasoning by filling the gaps with 
phrases, we have a hypothesis, which, in classical logic, t is true or false, so their values 
are 1 or 0, respectively. Fuzzy logic is not only to do with the values 0 and 1 but 
explores these implications when their values are between 0, 1. 

Basically, a fuzzy system is in essence a system of linguistic rules of the form “if 
.... then ....”, which match two fuzzy linguistic concepts A and B according to natural 
language and common sense, as in the following examples 


“If someone is tall, then (s) he is also heavy” 
or 
“If it snows heavily, then the road gets dangerous”. 


In other words, through implications and fuzzy operations, fuzzy systems enable 
“engines” and mathematics to incorporate the way of expression of everyday language 
and common sense. Note here that both mathematics and engines function according to 
Boolean algebra. 

In this paper our goal is to import a way of generating fuzzy implications through 
fuzzy negations as has been described in the literature. 

A fuzzy implication is a generalization of the classical implication, in the same way 
that a t-norm and a t-conorm are generalizations of the classical conjunction and 
disjunction, respectively. In the rest of the paper we will import the most fundamental 
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properties of fuzzy conjunctions and examine the relations that connect negations and 
conjunctions in fuzzy logic theory. 

To find out whether a fuzzy negation can generate fuzzy implications, the we need 
to ensure that effect of negation must be such that it will not alter the axioms of fuzzy 
implications. For this reason, it is necessary to briefly mention the basic axioms and 
properties of fuzzy implications [1]. 


2 Theoretical Background 


In this paper as a definition of a fuzzy implication we will use the definition proposed 
by Kitainik [2], Foodor and Roubens [3]. 


2.1 Fuzzy Implication 


Definition 1. A function 7: [0,1] x [0,1] — [0, 1] is called a fuzzy implication if for 
all x,x1,x2, y, y1, y2 € [0, 1] the following conditions are satisfied: 

d1) xl < x2 then I(x), y) > I(x, y), i.e., Z (°, y) is decreasing, 

(2) yl < y2 then I(x,y1) < I(x, y2), i.e., I (x, *) is increasing, 

(B) J(0,0)=1 

@). 10,1)=1 

(5) 1(1,0)=0 


The set of all fuzzy implications will be denoted by FTI. 
Examples for Fuzzy Implications are given in the Table 1 below: 


Table 1. Examples for fuzzy implications. 


Name Formula implication 


Lukasiewicz | Ipx(x,y) = min{1,1 — x+y} 
Godel 
Ico(&, y) = { 


Reichenbach | Igc(x, y) = 1— x + xy 


l1 wx<y 
y ax>y 


Kleene-Dienes | [xp = max(1 — x, y) 


2.2 Basic Properties of Fuzzy Implications 


Additional properties of fuzzy implications have been published in many works (see 
Trillas and Valverde [6], Dubois and Prade [7], Smets and Magrez [8], Fodor and 
Roubens [3], Gottwald [4]). The most important of them are presented below [1]. 


Definition 2. A fuzzy implication / is said to satisfy 


— the left neutrality property, if 


I(1,y) =y, ye0,1} (NP) 
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— the exchange principle, if 
I(x, 1(y,z)) = 1(1,1(x,2)), x,y,z € [0,1] (EP) 
— the identity principle, if 
I(x,x) =1, x € [0,1] (IP) 
— the ordering property, if 


I(x,y) =1<Sx<y, x,y € [0,1]. (OP). 


2.3 Fuzzy Negation 


The fuzzy implication and fuzzy negation must be defined together. 

A fuzzy negation N is a generalization of the classical complement or negation ~=. 
Fuzzy negation truth table consists of the two conditions: 1 = 0 and —0 = 1. The 
following definitions can be found in any introductory text book on Fuzzy logic (see, 
Fodor and Roubens [3], Klir and Yuan [4], Nguyen and Walker [5]). 


Definition 3. A function N: (0, 1) — [0, 1] is called a Fuzzy negation if 


N is decreasing. (N2) 
Definition 4 
— A fuzzy negation N is called strict if, in addition, 
N is strictly decreasing, (N3) 
N is continious, (N4) 
— A fuzzy negation N is called strong if the following property is met, 
N(N(x)) =x, x € [0,1]. (N5) 
In this paper the strong negation will be denoted by 
N(x) x € [0, 1] 


Examples for Fuzzy Negations are given in the Table 2 below: 
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Table 2. Examples of fuzzy negation with properties. 


Formula Properties 
Nx(x) = 1-2 N1 to N4 strict 
NM) = 1 — yx N1 to N4 strict 
Sugeno class N*(x) = Toy, 4€(-1,+00) | NI to NS strong 
Yager class N”(x) = (1 x), w € (0, +00) NI to NS strong 


2.4 Law of Contraposition 


One of the most important tautologies in classical logic is the law of contraposition: 

P > q= "q > =p 

“P >q4=7q >P 

P > =q =q > P 
Definition 5. Let J € FI and N be a fuzzy negation. / is said to satisfy the 
— law of contraposition with respect to N, if 

I(x, y) = (NO), N(x)), x,y € [0,1]. (CP) 
— law of left contraposition with respect to N, if 
I(N(x), y) = (NO), x), x,y €[0,1] (L— CP) 

— law of right contraposition with respect to N, if 


I(x, N(y)) = 10, N(x)), x,y € [0,1] (R — CP) 


If I satisfies the (left, right) law of contraposition with respect to N, then we denote 
this by CP(N). 

If I satisfies the left or right law of contraposition with respect to N, then we denote 
this by L — CP(N) or R — CP(N), respectively [1]. 


2.5 Natural Negations of Fuzzy Implications 


Lemma 1. If a function 7 : [0, 1]?> [0, 1] satisfies (71), (13) and (15), then the func- 
tion Nz : [0, 1] — [0, 1] defined by 


Nı(x) = I(x,0), xe [0,1] (1) 


is a fuzzy negation. Proof [1]. 
Let J € FI. The function N; defined by (1) is called the natural negation of /. 
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3 New Results 


In this section, we will provide the definition of new generated implications and prove 
some propositions using definitions introduced in the previous sections. 


3.1 Production of Fuzzy Implications Through Fuzzy Negations 


Definition 7. Let f : [0,1] — [0,1] be a strictly decreasing and continuous function 
with f(1) =0,f(0) =1 and N a fuzzy negation. The function Z: [0, 1]°— [0, 1] 
defined by 


Ixy) =$ EN) -£(9)), x, y € [0,1] (2) 
is called an f — generated implication and is denoted Jp. 
Proposition 1. If f is an f — generator and N is fuzzy negation, then Jy € FI. 


Proof Firstly, since for every x, y € [0, 1] we have 0 < f(N(x)) - f(y) < 1, we see that 
formula (3) is correctly defined. 


e Since f is strictly decreasing, so is f~! and for any y € [0,1], 


xı S x2 => N(x1) > NO) > F(V(x1)) < FIN G2)) > 
SNG) FD) SNC) FO) > 
FNG) FO 28 (Na) FO) > 


I(x1,y) > I(x2,y) 1.e., Ip satisfies (Il). 


e Once again for any x € [0,1], we have 


yı E 1) 22) > SN) -F01) > FN) fo) > 
ENGO FO SETENE) FOR) > 
I(x, y1) < I(x, y2) ie., Ip satisfies (12). 
10,0) =P N(0)) -f(0)) = 
=F > Day 0) = 1 ie., Ip satisfies (I3).. 
71,1) = GN) F(A) = F-F(0) -0) =f-"(0) = 1ie., Ir satisfies (11). 
ip(1,0) = f-"(F(N(1)) -F(0)) = F-MF(O) -1) =F"(1) = Oe, Ir satisfies (11). 


3.2 Natural Negations and f — generator 


Proposition 2. Let I; be a fuzzy implication with respect to fuzzy negation N then the 
natural negation is N i.e., Ny, = N. 
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Proof Actually 


Ny (x) = 178,0) = (FIN) - FO 


3.3 Laws of Contraposition and /; 


Proposition 3. If N, is the strong negation of Ip then Jy satisfies the law of contra- 
position with respect to N,. 


Proof 


DN) NO) =F" ENNO FIN) =F FR) FEN) 
=P ANY) F(x) = Orr). 


Proposition 4. If N, is the strong negation of Ip then Jy satisfies the law of left 
contraposition with respect to N,. 


Proof 


Proposition 5. If N, is the strong negation of Ip then Ip satisfies the law of right 
contraposition with respect to Ns. 


Proof 


PUNO) FIN) = FO) FO) = 


3.4 The Left Neutrality Property 


Proposition 6. Let I; be a fuzzy implication with respect to negation N, then the fuzzy 
implication Jy satisfies the left neutrality property. 


Proof 


LA, y) =F ENO) FO) =$ FO) FO) =F FO) = 
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3.5 The Exchange Principle 


Proposition 7. Let Jy be a fuzzy implication with respect to negation N, then a fuzzy 
implication Jy satisfies the exchange principle. 


Proof 
I(x, ,2)) =F FQ) -F (0, 2))) 
=P UNO) SE ENO) FW) 
=P AN) FIN) -F (2) 
=P ANO) FIN) FE) 
Similarly 


19,16% 2)) =$ ANO) Fr 2))) FEN) FEN) FR) 


Therefore 


I(x, (y, z)) = py, I(x, 2)). 


3.6 The Identity Principle 


Proposition 8. Let /; be a fuzzy implication with respect to negation N, then a fuzzy 
implication Jy satisfies the identity principle if and only if x =O orx=1. 


Proof 
Ip(x,x) = 1 ee FON) Fa) = 1 
= f (N(x) f(x) = Ae 
<= f(N(x)) f(x) = 
<= f(N(x)) eee 


N(x) =lorx=1 
S>x=00rx=1. 
3.7 The Ordering Property 


Proposition 9. Let /; be a fuzzy implication with respect to negation N, then a fuzzy 
implication Jy satisfies the ordering property if and only if x =Oorx=1. 
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Proof 


Kay) =1 Ef (FIN) fo) 


| 
= 


NW) FO) =F) 
NR) Fo) =0 
+5 f(N(1)) = 0 or f(y) = 0 


x=0ory=1. 


ie.,x=0< y, Vy€ [0,1] or x <1, Vxe€ [0,1]. 


3.8 Example 
Let f(x) = 1 — x strictly decreasing 

1, ifx=0 
0 ,ifx € (0, 1] 
the implication of Definition 6 yields 


I(x, y) =F ANG) FO) 
if x = 1 then N(0) = 1 therefore 


and fuzzy negation N(x) = { then 


I(x, y) =F EFO FO) S y) =f" -F0)) > H(x, y) =f (0) = 1 
e (x,y) =f7'(0) = 1. 


if x € (0, 1] then N(x) = 0 therefore 
Ixy) =F FO FO) S Ty) = fF) =F FO) = y 
That is, we produce (implication) 


_ 1, ifx = 0 
m) e fxe(0,1]' 


4 Conclusion 


The above procedure has enabled us to generate a new class of fuzzy implications. The 
importance of this relies on the fact that the reasoning process has improved, since we 
have the possibility, for a given application, to choose the most appropriate implication 
from a wider class. This methodology, to choose the most suitable fuzzy implication 
from a given class of implications, will be applied in a forthcoming study based on 
statistical data from previous research ([10-12]). 
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Abstract. Facial expression recognition has significant application value in 
fields such as human-computer interaction. Recently, Convolutional Neural 
Networks (CNNs) have been widely utilized for feature extraction and expres- 
sion recognition. Network ensemble is an important step to improve recognition 
performance. To improve the inefficiency of existing ensemble strategy, we 
propose a new ensemble method to efficiently find networks with complemen- 
tary capabilities. The proposed method is verified on two groups of CNNs with 
different depth (eight 5-layer shallow CNNs and twelve 11-layer deep VGGNet 
variants) trained on FER-2013 and RAF-DB, respectively. Experimental results 
demonstrate that the proposed method achieves the highest recognition accuracy 
of 74.14% and 85.46% on FER-2013 and RAF-DB database, respectively, to the 
best of our knowledge, outperforms state-of-the-art CNN-based facial expres- 
sion recognition methods. In addition, our method also obtains a competitive 
result of the mean diagonal value in confusion matrix on RAF-DB test set. 


Keywords: Convolutional Neural Networks - Ensemble learning 
Expression recognition 


1 Introduction 


Facial Expression Recognition (FER) analyzes the category (e.g., happiness, sadness) 
of human expression based on face recognition. FER has been widely studied as 
accurate recognition of human facial expression is a fundamental step for many 
computer vision applications, such as medical security and human-computer interac- 
tion. Significant progress has been made in the last decade [1-5]. However, FER is a 
difficult task due to various illumination conditions, head position and occlusion in 
different face images. If feature extraction is carried out directly using these raw data, it 
would increase feature extraction error and eventually reduce FER performance. As a 
result, before feature extraction, preprocessing of facial images is necessary, such as 
face recognition, facial landmarks detection, face registration, histogram equalization, 
etc. 

Despite the continues research efforts, FER under uncontrolled environment is still 
a challenging problem [5]. So far, most top performance approaches tend to utilize 
shallow neural networks with ensemble learning methods [5—7]. Ensemble of networks 
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not only makes use of strong feature learning ability of neural networks, but also 
explores the ability of different networks to complement each other during ensemble 
learning. As a result, ensemble of multiple networks usually has better FER perfor- 
mance than single classifier based methods. However, these methods have three main 
limitations: (1) shallow networks need more training overhead than deep networks to 
reach the same training termination condition; (2) because of the weak fitting ability, 
shallow networks are often inferior to deep networks in terms of performance; (3) most 
ensemble learning methods utilize all trained networks to make final decisions. But 
according to our experiment, ensemble of all networks does not necessarily achieve 
optimal performance. To solve these problems, in this paper, we propose a new 
ensemble learning method which combines complementary CNNs to achieve high 
performance with less time consumption. The method framework is shown in Fig. 1. 
The main steps of this method are summarized as follows: 


— Twenty CNNs (including twelve deep CNNs and eight shallow CNNs) are trained 
as the candidate network set. 

— An optimal deep network is selected to form our baseline system according to 
recognition performance. 

— Candidate networks are added to or removed from our system until the best per- 
formance is achieved. 


Shallow network group Deep network group 


5) i a) feat J) =] Training candidate networks 


J 


RERA Network evaluation 


Baseline system Add or remove networks | 


a em) / Remove 
fed] > DEN Ensemble stage 


Fig. 1. Overview of our ensemble method. 


The proposed method is evaluated on two real-world facial expression databases (FER- 
2013 [8] and RAF-DB [9]). To the best of our knowledge, our method outperforms 
state-of-the-art top performing works on FER-2013 and RAF-DB databases. 
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2 Related Works 


2.1 Facial Landmarks Detection and Expression Recognition 


Face images obtained under non-restrictive settings tend to have different degrees of 
occlusions and varied postures. To extract accurate facial features from these images, 
facial landmarks detection is usually required. Xiong and Torre proposed a Supervised 
Descent Method for minimizing a Non-linear Least Squares function. They also pro- 
posed a well-defined alignment error function which can be minimized using existing 
algorithms [10]. Sun et al. proposed an effective three-layer CNNs cascaded for facial 
landmarks detection [11]. Ren et al. learned a set of highly discriminative local binary 
features for each facial landmark independently [12]. These features are then used to 
jointly learn a linear regression to quickly locate facial landmarks. Zhu et al. proposed a 
3D Dense Face Alignment and used cascaded CNNs to handle face alignment in the 
case of large pose variations and self-occlusions [13]. 

Kim et al. utilized alignable faces and non-alignable faces to improve FER per- 
formance [7]. They designed an alignment-mapping network to learn how to generate 
aligned faces from non-aligned faces. Rudovic et al. proposed a probabilistic method to 
implement facial expression recognition using head pose invariant [14]. The method 
performed head pose estimation, head pose normalization and facial expression 
recognition based on 39 facial landmarks. 


2.2 Neural Networks 


Krizhevsky et al. [15] proposed an eight-layer CNN in 2012 and made breakthrough 
progress in image classification. Because of their powerful feature representation 
ability, neural networks have been successfully applied to many computer vision 
applications, such as speech recognition [16] and semantic segmentation [17]. 
Recently, several FER methods utilized deep neural networks for improving perfor- 
mance. Liu et al. proposed a Boosted Deep Belief Network framework to carry out 
feature learning, feature selection and classifier construction iteratively [18]. Molla- 
hosseini et al. proposed a deep neural architecture which applied the Inception layer 
[19] to address FER problem across multiple standard face databases [20]. In [21], 
Tang showed that significant gains can be obtained on several deep learning databases 
by simply replacing softmax with L2-SVMs. Meng et al. proposed an identity-aware 
convolutional neural network to alleviate high inter-subject variations [22]. They 
introduced an expression-sensitive contrastive loss and an identity-sensitive contrastive 
loss to show that learning features are not influenced by the variations of facial 
expression and different subjects. Vo et al. proposed CNN-based method to detect 
global and local facial expression features [23]. In their work, global features were 
computed to obtain possible candidate classification results for a face, and then, local 
features were utilized to reorder the previously obtained candidates to yield final 
recognition results. 
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2.3 Ensemble Learning 


Ensemble learning builds a hypothesis set by training a series of learners [24]. It has 
been studied for a long time towards ensemble multiple neural networks in different 
visual fields [25-28]. As different neural networks provide complementary decision- 
making information, theoretically, the more diverse training networks are, the better 
performance they will be. Data preprocessing and different training configuration 
schemes can lead to network diversity (e.g., using different training sets, whether to 
adopt the dropout strategy [29]). In recent FER studies, combination of deep learning 
and ensemble learning has made remarkable progress. Yu and Zhang trained six 8-layer 
CNNs and automatically learned the ensemble weights among these networks by 
optimizing two loss functions [5]. Kim et al. constructed a hierarchical committee 
architecture with exponentially weighted decision fusion [6]. They combined nine 5- 
layer shallow CNNs with three 3-layer MLP classifiers (trained using features extracted 
from three alignment-mapping networks) in test stage [7]. Images in the training and 
test set were divided into alignable faces and non-alignable faces. The results on FER- 
2013 database showed that combination of alignable faces and non-alignable faces can 
improve FER performance. 


3 Proposed Approach 


3.1 Problem Analysis 


Depth of Networks. In general, adding network layers leads to significant increasing 
of network parameters. As a result, it increases the training overhead in time and space. 
For this reason, many works have limited the training to shallow networks [5-7]. 
However, we observe counter-examples in our experiments. For example, when using 
“Xavier” [30] for parameter initialization and “ReLU” for activation to train FER-2013 
database, shallow networks spend more training time than deep networks. Neverthe- 
less, these networks do not get expected performance improvement. This fact shows 
that considering the time overhead and recognition accuracy, we should primarily train 
deep networks. 

As some literatures have pointed out, the diversity of networks affects ensemble 
performance. However, to our best knowledge, most works did not explore the 
diversity of network depth. We believe that ensemble learning performance can be 
improved if shallow networks can also be trained to utilize network diversity. 


Ensemble Strategy. Ensemble of all networks does not necessarily achieve optimal 
performance, which is mainly based on the following consideration: some networks do 
not provide complementary capabilities to other networks. In this case, addition of 
more networks might introduce negative effects for samples which had been predicted 
correctly. 
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3.2 Configurations of All Networks 


Considering different data preprocessing, parameter initialization, activation function, 
training settings and network layer settings can lead to a variety of network models, we 
train eight 5-layer networks (shallow CNNs) and twelve 11-layer VGGNet [31] vari- 
ants (deep CNNs) for ensemble stage. 

The forward propagation process of 5-layer shallow CNNs is shown in Fig. 2. The 
architecture can be simplified as CPCPCPFDF (C, P, F, and D stands for Convolution, 
Pooling, Fully connected layer, and Dropout, respectively). The detailed configurations 
of 5-layer networks are summarized in Table 1. All of these networks use ReLU [32] as 
activation function. 


. 


= C+P = Flattened D 
Input images og ne a. == => | 


hS 
42x42 32x21x21 32x11x11 64x6x6 7 
1024 


Fig. 2. The forward propagation process of 5-layer shallow CNNSs. 


Table 1. Configurations of eight 5-layer networks. Raw: Raw train data. Hist: Histogram 
equalization. Prep: Preprocessing methods. Stand: Standardization. M-M: Maximum-Minimum 
normalization. WIni: Weight Initialization. TruN: Truncation Normal distribution. Xav: Xavier 
initialization [30]. WRe: Weight Regularization. FCDrop: Dropout strategy used in Fully 
Connected layer (FC). FC: The first Fully Connected layer. 
Config | Data | Prep |WIni WRe |FCDrop 

Raw | Stand | TruN | 0.0001 FC; = 0.5 

Raw | Stand | Xav |0.0001 FC, = 0.5 

Raw | M-M | TruN | 0.0001 FC; = 0.5 

Raw | M-M | Xav | 0.0001 FC, = 0.5 

Hist | Stand | TruN | 0.0001 FC; = 0.5 

Hist | Stand | Xav | 0.0001 FC; = 0.5 

Hist | M-M | TruN | 0.0001 FC; = 0.5 

Hist | M-M | Xav 0.0001 FC; = 0.5 


OY HN) NW) Bi UIN m 


The forward propagation process of 11-layer VGGNet variants is shown in Fig. 3. 
Their architecture can be expressed as 4*(CCPD)FDFF (4* indicates repeat four times). 
The detailed configurations of 11-layer CNNs are summarized in Table 2. The acti- 
vation process of 11-layer CNNs uses BN+ReLU, ReLU+BN, and ReLU, respectively. 
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Fig. 3. The forward propagation process of 11-layer VGGNet variants. 


Table 2. Configurations of twelve 11-layer VGGNet variants. BN: Batch Normalization [33]. 
Act: Activation function. BN+ReLU: Execute BN first, then ReLU. [ReLU+BN]: Execute ReLU 
first, then BN for all layers except for last FC layer (only ReLU). CCP: Successive Convolution, 
Convolution, and Pooling. CCPDrop: Dropout strategy used after every CCP. 


Config Data Prep | WIni | Act CCPDrop 
9 Raw Stand | Xav | BN+ReLU | 0.2 
10 Raw Stand | Xav | [ReLU+BN] | 0.2 
11 Raw Stand | Xav | ReLU 0.2 
12 Raw M-M | Xav BN+ReLU [0.2 
13 Raw M-M | Xav | [ReLU+BN] | 0.2 
14 Raw M-M | Xav | ReLU 0.2 
15 Hist | Stand | Xav |BN+ReLU [0.2 
16 Hist | Stand | Xav | [ReLU+BN] | 0.2 
17 Hist | Stand | Xav ReLU 0.2 
18 Hist M-M | Xav BN+ReLU |0.2 
19 Hist M-M | Xav [ReLU+BN] | 0.2 
20 Hist M-M | Xav ReLU 0.2 


In training stage, we use exponential decay learning to update a new learning rate. 
The learning rate of a network is updated as: 


n = No * (0.99)% (1) 


where no denotes the initial learning rate, y represents new learning rate, and N is the 
number of epochs. In order to fully learn the feature of training samples, a more severe 
termination condition must be satisfied to stop the training, that is, the training error of 
a batch does not exceed 10°° for three consecutive times or the training reaches 
maximum number of iterations. 


3.3 Ensemble Method 


According to the above analysis, combination of complementary CNNs improves 
system performance. It can be achieved by gradually adding networks that improve 
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recognition accuracy and removing networks which cause system performance 
degradation. 

In general, the fitting ability of deep networks is better than that of shallow net- 
works. Therefore, at the beginning, we select a network with the best accuracy from 
deep network group as our baseline system. 

In the next step, candidate networks from all shallow and deep networks are added 
to or removed from baseline system step by step until the best performance is achieved. 
The system ensemble mode in this paper is majority vote. It is important to note that the 
evaluation scores of all networks on validation and test data have been calculated in 
advance, so we do not need to spend a long time in the process of ensemble selection. 
All we need to do is matrix addition. 


4 Experiments on the FER-2013 Database 


4.1 FER-2013 


FER-2013 [8] is one of the largest facial expression databases so far. It has 28,709 
images for training, 3,589 images for public test, and 3,589 images for private test. To 
reduce training errors, we remove 46 non-face images and 11 non-number filled images 
from original database. 

We use IntraFace [10, 34] to detect facial landmarks. We label an image as Non- 
Alignable Faces (NAF) if its detection score is smaller than a given threshold, and 
otherwise, label it as Before Registered Alignable Faces (BRAF). The affine trans- 
formation principle is applied to adjust two eyes to horizontal position. We refer to the 
After Registered Alignable Faces as ARAF. 

Data increment is implemented for training, validation and test set following the 
method introduced in [7]. Specifically, 10 times increment are used in this work (four 
42 * 42 corners and a resize of original image, as well as their horizontal flip images). 


4.2 Training and Evaluation 


In training stage, the initial learning rate of shallow and deep network group is set to 
0.05. The maximum number of iterations for shallow and deep network groups is 
600000 and 200000, respectively. 

During validation and testing stage, the score of each image is the mean of 10 
corresponding incremental images. For Alignable Faces (AF), we evaluate Before 
Registered Alignable Faces (BRAF) and After Registered Alignable Faces (ARAF) 
respectively and average the two values. After evaluating Non-Alignable Faces (NAF), 
results of all validation (testing) samples are combined using the following formula: 


acc = acc(AF) x* a +acc(NAF) * ß (2) 


where al is the proportion of alignable faces in validation (testing) set, and ß is the 
proportion of non-alignable faces in validation (testing) set. 
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4.3 Ensemble and Analysis 


For FER-2013, we conduct network ensemble experiments on validation set. After 
determining the optimal network combination, testing set is used as the final perfor- 
mance evaluation. 


Baseline System. In deep network group, network No. 11 is selected as the baseline 
system as it has the highest validation accuracy (70.52%). 


Ensemble Process. In this experiment, ensemble of all deep and shallow networks is 
utilized to explore the change of system performance. Candidate networks are selected 
from deep and shallow network groups. At the beginning, network No. 18 is selected as 
combination of No. 18 and baseline system set has top performance (71.79%). In the 
second step, network No. 13 is selected as combination of network No. 13 and new 
system has best performance (72.35%). This process continues until system perfor- 
mance is no longer growth. In the seventh step, after removing network No. 3, the 
highest performance (72.64%) is obtained. The ensemble process is summarized in 
Table 3. We observe performance reduction when more networks are added. For 
example, ensemble of all 20 networks yields 72.30% validation accuracy. Finally, the 
system achieves 74.14% test accuracy with an ensemble of five deep CNNs. 


Table 3. Ensemble process on FER-2013. 


Steps | System Acc | Select | Candidate 
1 11 70.52 | 18 1-20 
2 11 18 71.79 | 13 1-20 
3 11 18 13 72.35| 9 1-20 
4 11 18 13 9 72.42 | 12 1-20 
5 111813912 |72.64| 3 1-20 
6 1118139123 72.62| - 1-20 
TH 111813912 |72.64| 3 1-20 


To prove the feasibility of our method, we list ensemble accuracy of the shallow 
network group, the deep network group, and all networks on the validation set in 
Table 4. 


Table 4. Performance comparison of different combinations on FER-2013. 


Networks Accuracy 


Shallow network group | 70.52 


Deep network group | 72.16 
All 20 networks 72.30 


Result Analysis. Table 5 lists performance comparison of ours and state-of-the-art 
works on the FER-2013 database. The proposed method only combines five deep 
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CNNs (11-layer) to achieve 74.14% test accuracy. To our best knowledge, this method 
outperforms other state-of-the-art CNN-based FER methods. Moreover, the proposed 
method is efficient than other methods. The method is implemented on a personal 
computer with i7-7700k CPU, 16 GB memory and a GTX 1080Ti GPU. The average 
time to process a test image using a shallow network and a deep network is 12.3 ms 
and 14.1 ms, respectively. Ensemble of five deep networks consumes 72.7 ms. As a 
contrast, [5] and [7] spend 76.3 ms and 146.6 ms to process the same image on our 
personal computer. 


Table 5. Performance comparison of the proposed method and state-of-the-art works on FER- 
2013. 


Methods Accuracy | Average time (ms) 
[21] | A DCN using L2-SVM Loss. 71.16% |- 
A DCN using cross-entropy Loss 70.1% - 
[5] | Ensemble of six 8-layer CNNs using learned weights | 72% 76.3 
[6] | Ensemble of 36 DCNs in a hierarchical committee 72.72% |- 
[7] | Ensemble three MLP classifiers 73.73% | 146.6 
and nine 5-layer CNNs 
Ours | Five 11-layer CNNs 74.14% | 72.7 


5 Experiments on the RAF-DB Database 


5.1 RAF-DB 


RAF-DB [9] is also a real-world facial expression database that used the crowdsourcing 
technology for facial annotation. The database contains about 30000 images of basic 7 
single-class expressions and 11 compound expressions. In our experiment, we use only 
15339 registrated images of single-class expressions, including 12271 training images 
and 3068 test images. We tripled the training and test set, including an original image, 
and its horizontal mirror and vertical mirror. 


5.2 Training and Evaluation 


In training stage, the initial learning rate of shallow and deep network group is set to 
0.01 and 0.05, respectively. The maximum number of iterations for shallow and deep 
network group is 200000 and 20000, respectively. 

During testing stage, the score of each image is the mean of three corresponding 
incremental images. 


5.3 Ensemble and Analysis 


Baseline System. Similar to FER-2013, the best performing network No. 19 (83.41%) 
from deep network group is selected as baseline system. 
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Ensemble Process. Candidate networks are also selected from deep and shallow 
network groups. In the third step, the system performance increases to 85.46% when 
three networks are combined. Please see Table 6 for detail information. Since then, 
adding more networks leads to system performance reduction. For instance, adding 
network No. 19 reduces system performance from 85.46% to 85.23%. However, the 
highest ensemble performance can be observed after removing network No. 19 from 
system network set. Finally, the system achieves 85.46% accuracy with an ensemble of 
two deep CNNs and one shallow CNNs. 


Table 6. Ensemble process on RAF-DB. 


Steps | System Acc | Select Candidate 


1 19 83.41 | 13 1-20 
2 19 13 84.88 | 3 1-20 
3 19 13 3 85.46 | 19 1-20 
4 1913319 85.23) — 1-20 


SH 19 13 3 85.46 | 19 1-20 


In Table 7, we list the ensemble performance of shallow network group, deep 
network group, and all 20 networks on the RAF-DB database. 


Table 7. Performance comparison of different combinations on RAF-DB. 


Networks Accuracy 


Shallow network group | 83.54 


Deep network group | 84.39 


All 20 networks 84.42 


Table 8. Our method is compared with the existing methods on two evaluation criteria: 
diagonal average of confusion matrix (Ave) and recognition accuracy (Acc). The results of center 
loss [35] + LDA, center loss + mSVM, DLP-CNN [9] + LDA and DLP-CNN + mSVM are 
tested in [9]. Seven numbers in the second line represent the number of samples of different 
expressions on original training set. Sur: Surprise, Fea: Fear, Dis: Disgust, Hap: Happy, Ang: 
Anger, Neu: Neutral. 


Methods Sur Fea ‚Dis |Hap |Sad | Ang |Neu | Ave | Acc 
1290 281 |717 | 4772 | 1982 705 | 2524 
Our 80.24 | 47.30 | 45 94.68 | 82.22 | 74.07 | 90.59 | 73.44 | 85.46 


center loss + LDA 76.29 | 54.05 | 49.38 | 92.41 | 74.90 | 64.81 | 77.21 | 69.86 | 79.96 
center loss + mSVM | 79.63 | 54.05 | 53.13 | 93.08 | 78.45 | 68.52 | 83.24 | 72.87 | 82.86 
DLP-CNN + LDA _ | 74.07 | 52.50 | 55.41 | 90.21 | 73.64 | 77.51 | 73.53 | 70.98 | 78.81 
DLP-CNN + mSVM | 81.16 | 62.16 | 52.15 | 92.83 | 80.13 | 71.60 | 80.29 | 74.20 82.84 
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Result Analysis. The proposed ensemble method not only achieves the best recog- 
nition accuracy, but also have competitive results for the average accuracy of seven 
single-class expressions (the mean diagonal value of confusion matrix). After ensemble 
of three networks, the values of diagonal in the confusion matrix are shown in Table 8. 
As shown in the table, four existing methods are listed for comparison. All of them 
apply different loss functions to train neural networks, and then use feature vectors 
extracted to train LDA and SVM classifier. In contrast, our method only uses softmax 
loss for training, and directly uses neural networks to present competitive classification 
performance. 


6 Conclusion 


In this paper, we propose a new ensemble learning based method for improving facial 
expression recognition. Specifically, two groups of CNNs (eight 5-layer CNNs and 
twelve 11-layer CNNs) are trained with various configurations. On this basis, a new 
network ensemble method is proposed to combine complementary CNNs to improve 
FER performance. Extensive experiments on FER-2013 and RAF-DB show that the 
proposed method achieves excellent recognition accuracy with less time overhead. 
Performance comparison of the proposed method and state-of-the-art works demon- 
strates that our method reaches the best recognition accuracy (74.14%) on the FER- 
2013 database. On RAF-DB database, our ensemble method also achieves the highest 
recognition accuracy (85.46%) and competitive performance of diagonal mean value of 
confusion matrix (73.44%) without complicated training process. 
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Abstract. Directly benefiting from the powerful generative adversarial 
networks (GANs) in recent years, various new image processing tasks 
pertinent to image generation and synthesis have gained more popular- 
ity with the growing success. One such application is individual portrait 
photo beautification based on facial expression detection and editing. 
Yet, automatically beautifying group photos without tedious and fragile 
human interventions still remains challenging. The difficulties inevitably 
arise from diverse facial expression evaluation, harmonious expression 
generation, and context-sensitive synthesis from single/multiple photos. 
To ameliorate, we devise a two-stage deep network for automatic group- 
photo evaluation and beautification by seamless integration of multi-label 
CNN with Bayesian network enhanced GANs. First, our multi-label CNN 
is designed to evaluate the quality of facial expressions. Second, our novel 
Bayesian GANs framework is proposed to automatically generate photo- 
realistic beautiful expressions. Third, to further enhance naturalness of 
beautified group photos, we embed Poisson fusion in the final layer of 
the GANs in order to synthesize all the beautified individual expres- 
sions. We conducted extensive experiments on various kinds of single- 
/multi-frame group photos to validate our novel network design. All the 
experiments confirm that, our novel method can uniformly accommodate 
diverse expression evaluation and generation/synthesis of group photos, 
and outperform the state-of-the-art methods in terms of effectiveness, 
versatility, and robustness. 


Keywords: Beautification of group-photo facial expressions 
Multi-label CNN - Bayesian networks 
Generative adversarial networks - Poisson fusion 


1 Introduction and Motivation 


With the omnipresence of digital cameras in today’s society, group photos are 
routinely captured to record wonderful moments shared by families, friends, 
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colleagues, etc. Hence, higher expectations are focused on the overall quality 
of group photos. In practice, it is almost impossible to capture satisfying facial 
expressions in a synchronous way for all involved people at any moment with 
various types of hand-held devices. Therefore, it urgently needs to develop smart 
group photo evaluation and beautification techniques. However, to achieve this 
goal, there are still several challenges yet to be overcome, including evaluation of 
the group-photo facial expression on an individual basis, simultaneous generation 
of satisfying expressions for all people involved, natural synthesis integration of 
individual facial expression into the final production of a group photo, etc. Obvi- 
ously, evaluation and beautification of facial expressions in such unconstrained 
settings remain an ill-posed task due to various factors, such as non-frontal faces, 
varying lighting in different outdoor/indoor settings, and/or even the large vari- 
ation in facial identities and appearances. 

With a goal of tackling the aforementioned challenges, more research 
works began to endeavor great efforts in related techniques. For example, 
recent works have demonstrated generative adversarial networks (GANs) are 
extremely effective. This ranges from image translation [6,8,17,20], to face gen- 
eration [2,13,15,16] and even image completion [4,7,12]. Nonetheless, most of 
the existing methods commonly employ the entire feature space to approximate 
the generative feature distribution, which could not well respect facial expression 
details for all individuals involved. In addition, most of the existing works con- 
centrate on the attribute manipulation/transformation of single object, lacking 
a principled way to optimize group-photo facial expressions. 

In this paper, our research efforts are devoted to pioneering a systematic app- 
roach for synthesizing a satisfying group photo by leveraging the synchronized 
power of CNNs and GANs. Specifically, we propose a two-stage deep network for 
automatic group-photo evaluation and beautification, which could greatly reduce 
the negative influences caused by the diversity of faces. Figure 1 highlights the 
framework of our novel method, which mainly consists of three major steps: 
(1) Facial expression recognition with multi-label CNN and our newly-proposed 
facial expression evaluation metric—the multi-label CNN recognizes two main 
beautification related expressions (e.g., mouth-smiling and eyes-opening) and 
predicts the softmax value of the expression for further evaluation; (2) Face 
beautification with our Bayesian GANs—it is guided by the subspace clustering 
based on attributes-aware priors, wherein we pre-distribute all the attributes’ 
weights according to the specific face regions’ impacts on the entire face appear- 
ances; (3) Multiple single-person faces’ integration driven by ensemble Poisson 
fusion—we add a Poisson layer to naturally fuse single-person face into the origi- 
nal group photo with gradual gradient changes. The salient contributions of this 
paper can be summarized as follows: 


— We pioneer a two-stage group-photo beautification framework by combin- 
ing multi-label CNN with Bayesian network enhanced GANs, which could 
naturally and automatically perform evaluation and beautification on group 
photos in a uniform and elegant way. 
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— We propose novel Bayesian GANs to automatically generate beautiful expres- 
sions by embedding Bayesian prior network into the powerful CycleGANs, 
which has strong generalization ability for weakly-matched training datasets. 

— We propose to embed the Poisson image clone technique in the final layer of 
our Bayesian GANs in order to synthesize all the to-be-beautified expressions 
on all individuals from single-/multi-frame continuous group photos, which 
would lead to meaningful and harmonious manipulation in any local region 
of a group photo. 


Face Detection Expression Recognition Expression Evaluation and Beautification Image Fusion 
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Fig. 1. The architecture of our framework. A group photo is converted into several 
single-person faces by using the MTCNN [19], which is a multi-task cascaded convolu- 
tional network to process the face detection tasks from coarse to fine. 


2 Related Works 


Facial Expression Recognition Methods. Facial expression recognition has 
been gaining growing momentum, with a wide range of applications. Specially, 
the expression recognition methods based on CNNs [1,18] and DBN [9] have 
achieved excellent results on facial datasets. For example, Burkert et al. [1] pro- 
posed a facial emotion recognition architecture based on CNNs. It consists of 
two parallel feature extraction blocks (FeatEx), which dramatically improves 
the performance on public datasets. Liu et al. [9] proposed a boosted deep belief 
network (BDBN) for feature learning, feature selection, and classification in a 
loopy framework. However, these methods are in some sense cumbersome due 
to high-dimensional varying features for each attribute, leading to inefficiency in 
recognition. Therefore, we apply multi-task learning to simultaneously optimize 
multiple objective functions. 


Facial Expression Generation and Editing Methods. In recent years, 
many image generation approaches have been proposed. For example, Isola et 
al. proposed a pix2pix approach [5] and achieved amazing results on paired 
datasets. However, in many cases, paired data are not readily available. There- 
fore, the image conversion based on unpaired data is particularly important. 
Recently, Zhu et al. proposed the CycleGAN [20] method, which employed two 
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GANs and an additional cycle consistency loss to improve the quality of the gen- 
erated images. Meanwhile, DualGAN and DiscoGAN [6,17] adopted the similar 
idea for image-to-image translation based on unpaired data. Particularly, many 
GAN-based methods have also been proposed for face generation. Perarnau et 
al. introduced ICGAN [13], which combined the encoder with cGAN to manip- 
ulate face images conditioned on arbitrary attributes. Shen et al. introduced a 
framework [15] to avoid learning redundant facial information by learning resid- 
ual images, which only focused on the attribute-specific area of a face image. 
However, these works commonly have significant dependencies on the training 
dataset and are difficult to preserve more details on other images. Moreover, 
these methods are designed for single pre-processed face images instead of group 
photos. Therefore, we should solve this to achieve strong generalization ability 
for weakly-matched test datasets. 


3 Facial Expression Evaluation and Beautification 


3.1 Facial Expression Evaluation Based on Multi-label CNN 


In order to synthesize group photo with perfect facial expressions, we need 
to first select the face images that will be manipulated after face detection. 
Considering the unbalanced distribution of samples in the training and testing 
phases for multi-label classification, we adopt a mixed objective optimization 
network [14] to recognize different facial attributes. We perform a joint optimiza- 
tion over all the face attributes on CelebA dataset [10]. In practice, we focus on 
two main beautification related attributes, including mouth-smiling and eyes- 
opening. Based on the two attributes, we further construct a multi-label CNN 
to recognize the two expressions at the same time, and this multi-task loss is 


defined as > 


L(y) = $ olla) || fe) - va, (1) 

i=1 
where p(ily;(x)) is the assigned probability for the attribute i, which can make 
the training set biased. f;(x) and y;(x) respectively represent the predicted value 
and the ground truth for attribute 7. Meanwhile, we formulate a beautification 
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Fig. 2. Illustration of our facial expression evaluation pipeline. 
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evaluation metric for facial expressions, which facilitates beautifying group pho- 
tos with lower cost. First, we count the number of individual faces with better 
expression in each group photo, so that we can choose the relatively better 
group photo to serve as our baseplate image. The metric used for measuring 
facial expression is the softmax value V; = e7: / m e%i, which is obtained from 
the recognition network. As shown in Fig. 2, the softmax value Umn means the 
n-th person of the m-th group photo. We can directly substitute the target image 
with the highest softmax value (the softmax value must be greater than 0.5) for 
the worse one (the softmax value is less than 0.5) in the baseplate image using 
our improved Poisson fusion. It should be noted that, if there is no satisfying 
facial image of certain person, we resort to our Bayesian GANs to generate a 
desirable image. 


3.2 Facial Expression Beautification with Bayesian GANs 


Considering the importance of diverse faces with various kinds of attributes, 
as shown in Fig.3, we propose a three-layer Bayesian network to augment GAN 
models. Of which, the first layer of the Bayesian network relates to the attributes 
distribution prior, which is vital to cluster the semantics-similar images into one 
attributes-specific subspace. The second layer relates to the subspaces, which are 
clustered according to the attributes’ influences on the targeted face regions. The 
third layer relates to the trained GANs, which are guided by the attributes-aware 
priors resulted from subspace clustering. 


Weighted Attributes Taining Set Subspaces 
Modell 
(Slightly open) 
Priors Clustering Priors Model2 
(Middle open) 


Model3 
(Largely open) 


Attributes-aware Subspaces GANs Bank 


Fig. 3. Pipeline of our Bayesian GANs based on facial-attribute priors. 


In the first layer, we pre-distribute the attributes’ weights according to 
the specific regions’ impacts on the entire face appearance. The j-th origi- 
nal attribute label value of the i-th sample 2; € {1,—1} is re-distributed to 
zij € 1055, 0). Of which, aj; denotes the new weight of the positive attribute 
value, and the ‘0’ means the negative attribute value, which has no effects on 
face appearances. Based on such re-weighted attribute distribution in the first 
layer, we employ the k-means algorithm to perform subspace clustering on the 
training images according to the diverse attributes’ influence on the targeted 
face regions. Here, we use the mean square errors of the attribute vectors to 
cluster all the samples into K subspaces, 
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K 
E=), 2 e-ul", (2) 


i=1 zES; 


where S; denotes the i-th subspace, and ||z — u;||? is the Euclidean distance 
between sample z and the subspace center u;. 

After attribute-aware subspace clustering, we further describe the image sam- 
ple generating process from source domain X to target domain Y in details. 
Given two datasets X,Y: source domain X = {x;|1 <i < Nng} and target domain 
Y = {yi|l <i < ny}, nz, Ny respectively represent the numbers of dataset X 
and Y. We cluster the sample space into three subspaces S;,7 = 1,2,3 based on 
the attributes with important impacts. With the mapping function X — Y, we 
adopt a loss function as: 


Ly = Eynpan ll Dy (y) — 1)7] + Espana(e) (1 Dr (G(z)))"], (8) 


where X,Y € S;. Therefore, our Bayesian GANs have excellent generation abil- 
ity, which can successfully transform images between two domains according to 
the attribute-specific subspaces. Considering a test image, we first predict its 
40 facial attributes using a multi-label CNN model, and then calculate which 
subspace the test image belongs to, according to the prior knowledge and the 
Bayesian network. 

For our generator, we use three convolution layers to extract features from 
input images, six residual blocks to preserve the features of the original image, 
and simultaneously transform feature vectors from source domain to target 
domain. Meanwhile, we use three deconvolution layers to restore low-level fea- 
tures from feature vectors. Residual blocks consist of two convolution layers, 
wherein part of the input data is directly added to the output, so that we can 
reduce the deviation of the corresponding output from the original input. Finally, 
we use four convolution layers for our discriminative network. 


3.3 Poisson Fusion in Our GANs 


To obtain a natural group photo, we need to conduct global fusion via local image 
editing [11]. Therefore, we embed a Poisson fusion layer in our GANS' final layer. 
In this layer, we naturally fuse all the generated facial expressions of different 
persons into the selected baseplate group photo. The key of Poisson fusion is to 
obtain the transformed pixel by solving the Poisson equation. Here, we construct 
the linear equation according to the method of Poisson image editing as: Ax £ = 
b. Please refer to [3] for the details about this equation. 

If we solve the above Poisson equation with Gaussian elimination, it will 
exhibit a lot of time and memory cost. Considering the fusion region is a rect- 
angle, some characteristics of matrix A can be leveraged: A is sparse, positive 
definite, and can be partitioned into smaller square matrices. According to these 
characteristics, we adopt the conjugate gradient method to solve the equation. 
And we do not need to store the matrix A, because the conjugate gradient 
method only needs the value of A x p, which can be easily obtained via the 
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operation of block matrix. Thus, our method not only can embed larger region, 
but also can achieve more than 5000 times speedup (compared to the Gaussian 
elimination method) when both the height and width of the region are 100 pixels. 

In practice, for ease of image synthesis, we need to store the facial coordinate 
information during face detection. By means of Poisson fusion method, the gen- 
erated target images can be seamlessly fused into the selected baseplate group 
photo. Meanwhile, it can well keep the consistency of the color, texture, and 
illumination in the scene. 


4 Experimental Results and Evaluations 


Experimental Settings. We carefully design three types of experiments to 
evaluate the overall performance of our method: (1) single-person facial expres- 
sion beautification of a group photo; (2) single-frame image based group photo 
beautification (the images are randomly-crawled from the internet); (3) multi- 
frame continuous images based group photo beautification (the images are cap- 
tured by our hand-held device). CelebA is used as our training dataset, which 
includes 202,599 colored face images and 40-attribute binary vectors for each 
image. We use the aligned and cropped version and scale the images to the size 
of 128 x 128. In addition, the distribution of attribute labels are highly biased. 
In practice, for each attribute that needs to be edited, 1000 images from the 
attribute-positive class and 1000 images from the attribute-negative class are 
randomly chosen as our test set. We select all the rest images as our training 
dataset. Meanwhile, to demonstrate the superiorities of our method, we ran- 
domly search some facial images from the internet and take some photos casually, 
which also serve as our test dataset. Please refer to our supplemental document 
for more vivid results!. 


Input Cropped ICGAN[14]Residual[17] CycleGAN[22] Ours 


Old 


Fig. 4. Comparison of the mouth-smiling results produced by different methods on 
single-person faces of a group photo. 


1 https: //drive.google.com/file/d/159my8s52wzL-Eq9vGtubK DegMQLfLfQq/view? 
usp=sharing. 
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Evaluations on Single-person Facial Expression Beautification of 
Group Photos. Considering the detailed wrinkles on elder faces, we respec- 
tively conduct experiments on the different-age faces of group photos. As shown 
in Fig.4, we compare our results with those produced by some state-of-the- 
art methods, including ICGAN [13], learning residual images [15], and Cycle- 
GAN [20]. We observe that, the compared methods commonly have a significant 
dependence on the training dataset, thus, their results on other test images are 
not satisfactory. In sharp contrast, our results are more natural and can pre- 
serve more details. Moreover, when facing diversified and complicated expres- 
sion manipulation tasks, our approach outperforms the state-of-the-art facial 
expression beautifying methods with respects to effectiveness, versatility, and 
robustness. 


Evaluations on Single-frame Image Based Group Photo Beautification. 
In this kind of experiments, we use our generalization network to manipulate 
facial attributes and further synthesize a beautiful group photo. Our network can 
successfully synthesize semantically-meaningful and visually-plausible contents 
for the key face regions that need to be beautified. As shown in Fig. 5, our method 
can generate satisfying results with high perceptual quality, which shows a great 
promise for smart facial expression beautification during group photo capturing. 


Result Input 


Fig.5. The results of our method for single-frame image based group photo 
beatification. 


Evaluations on Multi-frame Continuous Images Based Group Photo 
Beautification. Our method can also synthesize a new satisfying group photo 
from unsatisfying multi-frame continuous images. Considering diverse poses in 
multi-frame continuous group photos, we detect facial landmarks from the gen- 
erated images and a group photo to locate a rectangle region of eyes/mouth for 
ease of fusing the manipulated regions. As shown in Fig.6, we replace worse 
facial expressions with the beautified ones in the baseplate group photo based 
on our improved Poisson fusion strategy. 


Quantitative Evaluations. To quantitatively evaluate the visual quality of the 
synthesized group photos, we carry out user study, wherein 20 people are asked 
to classify the randomly shuffled images as real or synthetic ones. Each person 
is shown a random selection of 50 real images and 50 synthesized images in a 
random order, and is asked to label the images as either real image or synthetic 
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Fig. 6. The results of our method for multi-frame continuous images based group photo 
beautification. 


image. Table 1 shows the confusion matrix, which indicates that, people feel very 
hard to reliably distinguish real images from our synthetic ones. 

Meanwhile, we conduct user study based on the survey from 20 participants, 
wherein participants are required to assess the visual realism, image quality, 
and individual detail preservation by asking them to label the best generated 
image from the randomly shuffled images generated by different methods. Table 2 
documents the results. For the voting about the best performance on attributes 
manipulation, our method gains the majority of votes. It clearly shows that, our 
method can well accommodate photo-realistic facial expression beautification 
for highly-diverse group photos. In addition, as shown in Fig. 7, we further ask 
participants to grade our results between 0 and 5 according to the image quality 
and visual realism. It confirms that, our method outperforms other approaches 
on facial expression beautification. 


Table 1. Visual Turing test results for distinguishing real/synthesized images. The 
average human classification accuracy is 57.25% (chance = 50%). 


Labeled | Labeled as 

as real | synthetic 
Real 557 443 
Synthetic | 412 588 


Table 2. Visual Turing test results about different-methods’ facial expression manip- 
ulations on group photos. The voting percentage sum of each column is equal to 100%. 


Methods | Mouth-smiling | Eyes-opening 
beautification | beautification 

ICGAN 0.7% - 

Residual 2.1% 1.3% 

CycleGAN | 37.3% 45.4% 

Ours 59.9% 53.3% 
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Mouth-smiling with Single-person Images Eyes-opening and Mouth-smiling with Group-photo Images 


Fig. 7. The subjective evaluation on different methods. 


5 Conclusion and Future Works 


This paper detailed a two-stage first-evaluation-then-beautification framework 
with which we could synthesize satisfactory group photos from original single- 
or multi-frame group photos that are routinely-captured in our daily life. Benefit- 
ing from the novel integration of multi-label CNN and Bayesian prior embedded 
GANs, our novel framework could generate natural and realistic images, which 
helps improve the generalization ability of facial expression manipulation and 
synthesis. Various qualitative and quantitative experiments were carried out to 
evaluate the overall performance of our method, and all the experiments con- 
firmed that, our method has apparent advantages over the existing techniques in 
terms of efficacy, effectiveness, versatility, and robustness. Despite many promis- 
ing results in most cases, the obtained results are sometimes less ideal. For 
example, it remains difficult to generate and synthesize group photos when we 
only have images of low quality, or images involving facial occlusion and/or com- 
plex body pose. Such challenging cases deserve more research efforts. Besides, 
we plan to exploit more intrinsic temporal context priors and how such priors 
could further enhance group photo beautification in the near future. 
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Abstract. Hierarchical systems are powerful tools to deal with non-linear data 
with a high variability. We show in this paper that regressing a bounded variable 
on such data is a challenging task. As an alternate, we propose here a two-step 
process. First, an ensemble of ordinal classifiers affect the observation to a given 
range of the variable to predict and a discrete estimate of the variable. Then, a 
regressor is trained locally on this range and its neighbors and provides a finer 
continuous estimate. Experiments on affect audio data from the AVEC’2014 and 
AV+EC”2015 challenges show that this cascading process can be compared 
favorably to the state of the art and challengers results. 


Keywords: Affective computing - Ensemble of classifiers - Random forests 


1 Introduction 


Nowadays, vocal recognition of emotions has multiple applications in domains as 
diverse as medicine, telecommunications or transport. For example, in telecommuni- 
cations, it would become possible to priorities the calls from individuals in imminent 
danger situations over less relevant ones. In general, emotion recognition enables the 
improvement of human/machine interfaces, which justifies the unexpected increase of 
research on this field, due to the progresses in artificial learning. 

Human interactions rely on multiple sources: body language, facial expressions, 
etc. A vocal message carries a lot of information that we translate implicitly. This 
information can be expressed or perceived verbally, but also non-verbally, through the 
tone, the volume or the speed of the voice. The automatic analysis of such information 
gives insights on the speaker emotional state. 

The conceptualization of emotions is still a hot topic in psychology. Opinions do 
not converge towards a unique model. In fact, we can mainly differentiate three 
approaches [8]: (1) the basic emotions like happiness, sadness, surprise, fear, anger, or 
disgust; described by Ekman, (2) the circumplex model of affect and (3) the appraisal 
theory. In the second model, the affective state is generally described, at least, by two 
dimensions: the valence which determines the positivity of the emotion and the arousal 
which determines the activity of the emotion [17]. These two values, bounded on [-1, 
+1], describe much more precisely the emotional state of an individual than the basic 
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emotions. However, it has been shown that other dimensions were necessary to report 
more accurately this state during an interaction [7]. 

The choice of one model or the other restrains the kind of machine learning 
algorithms used to estimate the emotional state. In case of basic emotions, the variable 
to be predicted is qualitative and nominal. Classification methods must be used. On the 
contrary, affective dimensions are quantitative, continuous, and bounded variables. So, 
regression predictor will be needed. To take advantage of the best of both worlds, we 
propose in this study a method that combines classification and regression. To predict a 
continuous and bounded variable, we first quantize the affect variable into bounded 
ranges. For example, a 5 ranges valence quantization would give the following 
boundaries {-1, —0.5, —0.2, +0.2, +0.5, +1}. It could be interpreted as “very negative”, 


39 é 


“negative”, “neutral”, “positive” and “very positive”. Then, we proceed into 3 steps: 


e Train an ensemble of classifiers to estimate if the affect value associated to an 
observation is higher than a given boundary;; 

e Combine the ensemble decisions to predict the optimal range; 

e Regress locally the variable on this range. 


The proposed method is therefore a cascade of ordinal classifiers and local regressors 
(COCLR). We will see in the following state of art that similar proposals have been 
made. But in this paper, we perform a thorough study on the key parameter of this 
method: the number of ranges to be separated by the ensemble of ordinal classifiers. 
We show experimentally that: 


e On small and numerous ranges, ordinal classification performs correctly; 

e On large ranges, the COCLR cascade performs better; 

e On challenging databases (AVEC’2014 [20] and AV+EC’2015 [16], described in 
Sect. 4), the COCLR cascade can be compared favorably to challengers’ and 
winner’s proposals with an acceptable development and computational cost. 


This paper is organized as follows. Section 2 focuses on the state of the art in affect 
prediction on audio data. In Sect. 3, we will present the COCLR flowchart. In Sect. 4, 
we will introduce the datasets used to train and evaluate our system and the different 
pre-processing realized. Then, in Sect. 5, we will expose and discuss our results. 
Finally, Sect. 6 offers some conclusions. 


2 State of the Art 


The Audio-Visual Emotion recognition Challenges (AVEC), that takes place every 
year since 2011, enables to assess the systems proposed on similar datasets. The main 
objective of these challenges is to ensure a fair comparison between research teams by 
using the same data. Particularly, the unlabeled test set is released to registered par- 
ticipants some days before the challenge deadline. Moreover, the organizers provide to 
the competitors a set of audio and video descriptors extracted by approved methods. 

The prosodic features such as the height, the intensity, the speech rate, and the 
quality of the voice, are important to identify the different types of emotions. Low level 
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acoustic descriptors like energy, spectrum, cepstral coefficients, formants, etc. enable 
an accurate description of the signal. 


2.1 Emotion Classification and Prediction 


The classification of emotion is done through classical methods like support vector 
machines (SVM) [1], Gaussian mixture models (GMM) [18] or random forests 
(RF) [14]. For regression task, numerous models have been proposed: support vector 
regressors (SVR) [4], deep belief networks (DBN) [12], bidirectional long-short term 
memory networks (BLSTM) [13], etc. As all these models having their own pros and 
cons, recent works focus on model combinations to improve overall accuracy. Thus, in 
[10], authors propose to associate BLSTM and SVR to benefit from the treatment of the 
past/present context of the BLSTM and the generalization ability of the SVR. 

AV+EC’2015 challenge winners proposed in [11] a hierarchy of BLSTM. They 
deal with 4 information channels: audio, video (described by frame-by-frame geometric 
features and temporal appearance features), electrocardiogram and electro dermal 
activity. They combine the predictions of single-modal deep BLSTM with a multi- 
modal deep BLSTM that performs the final affect prediction. 


2.2 Ordinal Classification and Hierarchical Prediction 


The standard approach to ordinal classification converts the class value into a numeric 
quantity and applies a regression learner to the transformed data, translating the output 
back into a discrete class value in a post-processing step [6]. Here, we work directly on 
numerical values of affect variables but quantify them into several ranges. Recently, a 
discrete classification of continuous affective variables through generative adversarial 
networks (GAN) has been proposed [2]. Five ranges are considered. 

The idea of a combining regressors and classifiers has already been applied to deal 
with age estimation from images. In [9], a first “global” regression is done with an SVR 
on all ages. Then, it is refined by locally adjusting the age regressed value by using an 
SVM. In [19] authors propose another hierarchy on the same issue. They define 3 age 
ranges (namely “child”, “teen” an “adult”). An image is classified by combining the 
results of a pool of classifiers (SVC, FLD, PLS, NN and naive Bayes) in a majority 
tule. Then, a second stage uses the appropriate relevant vector machine regression 
model (trained on one age range) to estimate the age. 

The idea of such a hierarchy is not new, but its application to affect data, have not 
been proposed yet. Moreover, we show in the following experiments that the number of 
boundaries to be considered impacts the performance of the whole hierarchy. 


3 Cascade of Ordinal Classifiers and Local Regressors 


The cascade of ordinal classifiers and local regressors proposed here is a hybrid com- 
bination of classification and regression systems. Let us note X, the observation (feature 
vector), y the affective variable to be predicted (valence or arousal) and y, the prediction. 
The variable y is continuous and defined on the bounded interval [-1; +1]. Therefore, it 
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is possible to segment this interval into a set of n smaller sub-intervals called “ranges” in 
the following, bounded by the boundaries b; and b;,; with i€{1, n + 1}. For example, 
n = 2 define 2 ranges: [-1;0[(“negative”) and [0; +1] (“positive”) and 3 boundaries 
b;e{-1, 0, +1}. Each boundary b; (except —1 and +1) may define a binary classification 
issue: given the observation X, the prediction y is lower (resp. higher) than b;. By 
combining the outputs of the (n — 1) binary classifiers, we get an ordinal classification. 
Given the observation X, the prediction y is probably (necessarily in case of perfect 
classification) located within the range [b,, b;,,]. Once this range obtained, a local 
regression is run on it along to its direct neighbors to predict y. Figure | illustrate the full 
cascade. The structure of this system is modular and compatible with any kind of 
classification and regression algorithms. Moreover, it is generic and may be adapted to 
other subjects than affective dimension prediction. 


X x x x X 
me... 0.0.0.5 
{ y>b,? ) y>b,?||y>b,? || y>b,? y>b,? | Ordinal classifier 
N | S N 
Le Er 
a C | Combination 
C 1 orl 
| 
v_* o 
| R, R, R, R, | R, coo) Ray Local regression 
H J a: y 
ER E 
y BE Regression range (R;) 


Fig. 1. COCLR: a two-stage cascade. The first stage is a combination of binary classifiers which 
aim is to estimate y’s range. The observation X is handled by the corresponding local regressor 
which will evaluate the value of y on this range and its neighbors. 


3.1 Ordinal Classification 


The regression of an affect value y on an observation X can be bounded by the 
minimum and the maximum this value might take. If y is not originally bounded, we 
bound it by the minimal and maximal values of the studied dataset. The interval on 
which y is defined, J = [min(y), max(y)], can be divided in n ranges. 

The first stage of the cascade is an ensemble of (n — 1) binary classifiers. Each 
classifier decides if, given the observation X, the variable to be predicted is higher than 
the lower boundary b; of a range or not. Training samples are labeled —1 if their y value 
is lower than b; and +1 otherwise. Considering the sorted nature of the boundaries b;, 
we build here an ensemble of ordinal classifiers [6]. 

We combine the decisions of these classifiers to compute the lower and upper 
bounds of the optimal range [b;, b;,,]. Consider an observation X with y = 0.15. 
Suppose the number of ranges n = 6 and linearly distributed boundaries b;. The 
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following ranges are defined: [-1.0, —0.5, —0.25, 0, 0.25, 0.5, 1.0]. In case of perfect 
classification, the output vector of the ensemble of classifiers will be: (1, 1, 1, 1, —1, 
—1} where —1 means “y is lower than b;” while +1 means “y is higher than b;”. 
Obviously, b; is the bound associated to the “latest” classifier with a positive output and 
b;, the first classifier with a negative output. By combining the local decisions of these 
binary classifiers, we get the (optimal) range [b;, b;,,]. This range C; will be used in the 
second stage to locally predict y. In the example, this range is [0, 0.25]. However, 
indecision between two classifiers can happen [15]. This indecision will be handled by 
the second stage of the cascade. 

The performance measure of the ordinal classifiers, the accuracy, is directly linked 
to the definition of the ranges. The choice of the number of ranges n is a key parameter 
of our system and can be seen as a hyper-parameter. The n ranges and their corre- 
sponding boundaries b; can be defined in several ways. If they are linearly distributed, 
they will define a kind of affective scale as in [2]. But the choice of the boundaries b; 
could also prevent strong imbalances between classes. In case of highly imbalanced 
classes, the application of a data augmentation method is strongly recommended [3]. 

From now on, we can evaluate the accuracy (ranges detection rates) of the classifier 
combination. It can also be used to compute a discrete estimate of y, using for y the center 
of the predicted range. Finally, we can estimate the correlation of $ to the ground truth y. 


3.2 Local Regression 


The aim of the second stage of the cascade is to compute the continuous value of 
y. Thus, each range i is associated to a regressor R; that locally regresses y on [b; biz]. 
So, each regressor is specialized in the regression on a specific range. However, as 
explained previously, indecisions between nearby classes throughout the ordinal 
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Fig. 2. Confusion matrix of the first stage of the cascade 
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classification may induce an improper prediction of the range. De facto, the wrong 
regressor can be activated, causing a drop of the correlation. The analysis of the first 
stage results, illustrated by the confusion matrix (Fig. 2), indicates that prediction 
mistakes are close enough or even connected to the optimal range which y belongs to. 
Thus, we can expand the regression range to [b,;-;; b;,>], if they exist. 

Widening the local regression ranges helps to solve the indecision issue between 
the nearby boundaries. Moreover, it frees us from the obligation to strongly optimize 
the first stage. In fact, the use of a perfect classifier instead of a classifier that reaches an 
accuracy of 90% on the first stage won’t modify deeply the result of the whole cascade. 
By the way, the second stage local regression produces a continuous estimate $ of y and 
it is possible to compute a correlation between both variables. 


4 Databases 


4.1 AVEC’2014 


The AVEC’2014 database is an ensemble of audio and video recordings of human/ 
machine interaction [20]. This base is composed of 150 recordings, each of them 
containing the reactions of only one person, realized from 84 German subjects. In order 
to create this dataset, a part of the subjects has been recorded many times with a break 
of two weeks between each recording session. The distribution of the records is 
arranged as following: 18 subjects have been recorded three times, 31 of them have 
been recorded twice and the 34 lefts have been recorded only once. Then the recordings 
are split in 3 parts: learning set, validation set, and test set. We used generic audio 
features provided by the organizers [21]. 


4.2 AV+EC’2015/RECOLA 


The second dataset we used to measure the performance of our system is the affect 
recognition challenge AV+EC’2015 [16]. The AV+EC’2015 relies on the RECOLA 
base. This one is composed of a set of 9.5 h of audio, video and physiologic recordings 
(ECG, EDA) from 46 records of French people with different origins (Italian, German, 
and French) and different genders. The AV+EC’2015 relies on a sub-set of 27 
recordings completely labelled. In our case, we only used the audio records and only 
worked on the valence, which is the most complex affect to be predicted in this 
challenge. The learning, development and testing partitions contain 9 recordings each. 
The diversity of origins and genders of the subjects has been preserved in these. The 
different audio features used are available in the AV+EC’2015 presentation paper [16]. 


5 Experimental Results 


5.1 Performance Metrics 


The cascade performance are directly linked to those of both stages. Thus, the per- 
formances of the ensemble of ordinal classifiers are measured by the accuracy. It 
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measures the ratio of examples for which the interval has been correctly predicted. We 
use the confusion matrix in order to analyze the behavior of this system in a more 
precise way. 

The performance of the ensemble of local regressors are measured using Pearson’s 
correlation (PC), gold standard metric of the challenge AVEC’2014 [20] on which we 
base our work. However, as these data are not normally distributed, we decided to 
measure the performance of our system with Spearman’s correlation (SC) and the 
concordance correlation coefficient (CCC) as well. 

The experimental results presented in the following are computed on the 
development/validation set of the different databases. 


5.2 Used Systems and Baseline 


The study of the valence on a bounded interval allows the identification of several 
intensity thresholds of the felt emotion. For example, we can qualify this as negative, 
neutral, positive, depending on this value. However, for the AVEC’2014 and the AV 
+EC’2015 bases, these intensity thresholds are no equally represented: more than 60% 
of the data belong to the range [-0.1; 0.1] and 80% within the range [-0.2; 0.2]. 
Considering the fact that some systems poorly support strong class unbalances, we 
increased the volume of our using the Synthetic Minority Over-sampling Technique [3]. 

As previously stated, our architecture is modular and adapted to any kind of 
classification or regression method. Throughout our experimentations, we tried to use 
support vector machines (C-SVM with RBF kernels) and random forests (RF with 300 
decision trees, attribute bagging on /nfeatures) as classifiers). Table 1 presents the 
ordinal classification rate obtained by these two systems on the development sets of 
AVEC’2014 and AV+EC’2015, for the prediction of valence. We choose this affect 
variable because it’s known to be particularly hard to predict. By taking the center of 
the predicted intervals as values of y, we have been able to process the correlations of 
these two systems. These correlations enable to compare the performance of our 
classifier ensemble to those of a unique “global” random forest regressor dealing with 
the whole interval [-1; +1]. 


Table 1. Valence prediction: comparison of different ordinal classifiers (SVM-OC and RF-OC) 
and one global random forest regressor (RF-GR) on a subset of the training set. The performance 
measure is the Pearson correlation coefficient. 


AVEC’2014 | AV+EC’2015 
Baseline | 0.38 0.17 
RF-GR | 0.45 — 
SVM-OC | 0.61 0.56 
RF-OC | 0.77 0.65 


The results obtained on both databases encouraged us to continue with random 
forests rather than the support vector machines. Indeed, the results returned by these are 
significantly sharper than the SVM ones, independently of the choice of the sub- 
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intervals. For the same reasons, we have decided to use an ensemble of random forests 
to perform local regression. 


5.3 Results on AVEC’2014 


Table 2 compares the performance of the different systems presented on the devel- 
opment base of the AVEC’2014, while using several number of ranges n. 

First, the interval J has been split here in 10 ranges: [-1.0; —0.4; —0.3; —0.2; —0.1; 
0.0; 0.1; 0.2; 0.3; 0.4; 1.0]. The most performant system in term of correlation is here, 
without a doubt, the ordinal classifier ensemble, where the values are the centers of the 
predicted ranges. It is as well relevant to point out that, despite the very high correlation 
of the local regressors alone, the COCLR system does not seem efficient. 

Then, the interval J has been split into 6 ranges: [-1.0; —0.4; —0.2; 0.0; 0.2; 0.4; 
1.0]. The most performant system, as far as the correlation is concerned, is still the 
ordinal classifier ensemble. However, the performance gap between the COCLR and 
the ordinal classifier ensemble has tightened. It is also noteworthy that the accuracy of 
the classification system has risen and the correlation of the local regressors alone, has 
slightly dropped. 

Finally, the interval Z here has been split into n = 4 ranges, [-1.0; 0.3; 0.0; 0.3; 
1.0].previous conclusions on ordinal classifiers and local regressors remain checked. 
But this time, the COCLR cascade, turned out to be significantly the most efficient one. 
The correlation bound to this system is the highest obtained for every choice of 
intervals of any sort. These different results highlighted the importance of the choice of 
the number of ranges on which the COCLR system stands. 


Table 2. Valence prediction: impact of the number of ranges on performance of global regressor 
(GR), ordinal classifier (OC), local regressors (LR) and cascade (COCLR). LR performance are 
computed considering the classification as “perfect” (Accuracy = 1). 


n |Model | Accuracy | Pearson C | Spearman C | CCC 
1|GR - 0.45 0.47 0.27 
10 | OC 0.78 0.69 0.70 0.60 
LR - 0.91 0.90 0.89 
COCLR |- 0.51 0.53 0.37 
6| OC 0.83 0.63 0.66 0.54 
LR - 0.85 0.85 0.76 
COCLR |- 0.54 0.53 0.39 
4,0C 0.89 0.47 0.48 0.29 
LR - 0.80 0.81 0.77 
COCER |- 0.77 0.77 0.65 


5.4 Results on AV+EC’2015/RECOLA 


As we did previously, we measured the performance of our system according to the 
different sub-intervals. Affect value varies within [-0.3; 1] so we discard classifier and 
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regressors trained on [-1; —0.3]. The Table 3 presents a summary of these results. 
Throughout our tests, we used 3 groups of different sub-intervals. The biggest, com- 
posed of 8 ranges, is: [-0.3; —0.2; —0.1; 0.0; 0.1; 0.2; 0.3; 0.4; 1.0]. The second one, 
composed of 5 ranges, is: [-0.3; —0.1; 0.0; 0.1; 0.3; 1.0]. Finally, the last one, com- 
posed of 3 ranges, is: [-0.3; 0.0; 0.3; 1.0]. 

We can observe in Table 3 that the results on the RECOLA database are similar to 
the ones the AVEC’2014. In fact, the most performant system remains the COCLR, 
when we chose a small number of ranges. The correlation obtained by the cascade of 
ordinal classifiers and local regressors for the valence on the development base is worth 
0.67. As previously, we have observed a decline of the correlation of the local 
regressors and a rise of the accuracy of the first stage of the cascade when the size of 
the sub-intervals increased. Comparisons with challenge winner’s results [11] are 
encouraging. Though our cascade get lower results (0.675) than their multimodal 
system (0.725), it get better result than those obtained on the audio channel only 
(0.529). These latter are similar to those of the first stage ordinal classifier (0.521). 

Last but not least, our proposal is fast to train (<10 mn for 3 ranges) and evaluate 
(<0.1 ms) on an Intelcore I7-8 cores-3.4 GHz and doesn’t require too memory space 
(<1 Go for 3 ranges). 


Table 3. Valence prediction: best obtained models for each number of ranges on AV+EC’2015 
development set. Challenge results [11] on the audio channel (AC) and their multimodal system 
(MM). The performance measure is the Pearson correlation coefficient. 


Proposal Baseline AC | Winner AC | Winner MM 
N 1 5 3 - - - 
Best model | GR |OC |COCLR|- - |- 
Pearson C | 0.463 | 0.521 | 0.675 | 0.167 0.529 | 0.725 


6 Conclusions 


We propose in this article an original approach for the regression of a continuous, 
bounded variable, based on a cascade of ordinal classifiers and local regressors. We 
chose to applicate it to the estimation of affective variables such as the valence. The 
first stage allows us to predict a trend, depending to the chosen interval. Thus, taking 
into account, for example, four intervals, the emotional state of a person will be 
qualified as very negative, negative, positive or very positive. We have been able to 
observe that this trend is more accurately estimated while the number of interval is 
increasing. The second stage enable a sharper prediction of the variable by regressing 
locally, on its interval and its direct neighbors. It seems even more efficient when the 
number of considered interval is low. Indeed, it allows to reduce the influence of the 
first stage on the prediction. Finally, we showed that the performances of this cascade 
can be compared favorably to those of the winner of the challenge AV+EC’2015. 
Despite these satisfying results, there are still room to improve it (others than 
applying it to the prediction of the arousal and the — running — assessment of the 
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performances on the challenges test data). The COCLR is a cascade which first stage is 
an ensemble of classifiers. The decision here is sanctioned by the least performant 
classifier. A more adapted combination rule would impact advantageously the global 
performances. The outputs (binary or probabilistic) of the ordinal classifier might also 
enrich the descriptors used by the local regressors. 
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Abstract. Gender information has been widely used to improve the per- 
formance of speech emotion recognition (SER) due to different expressing 
styles of men and women. However, conventional methods cannot ade- 
quately utilize gender information by simply representing gender char- 
acteristics with a fixed unique integer or one-hot encoding. In order to 
emphasize the gender factors for SER, we propose two types of features 
for our framework, namely distributed-gender feature and gender-driven 
feature. The distributed-gender feature is constructed in a way to repre- 
sent the gender distribution as well as individual differences, while the 
gender-driven feature is extracted from acoustic signals through a deep 
neural network (DNN). These two proposed features are then augmented 
into the original spectrogram respectively to serve as the input for the 
following decision-making network, where we construct a hybrid one by 
combining convolutional neural network (CNN) and bi-directional long 
short-term memory (BLSTM). Compared with spectrogram only, adding 
the distributed-gender feature and gender-driven feature in gender-aware 
CNN-BLSTM improved unweighted accuracy by relative error reduction 
of 14.04% and 45.74%, respectively. 


Keywords: Speech emotion recognition - Gender information - DNN 
CNN - BLSTM 


1 Introduction 


It is believed that SER can significantly improve the quality of spoken dialogue 
systems. Although SER has been studied for many years, machines still have 
difficulties in recognizing speakers’ emotions. 

In many studies [1,2], gender differences are observed in emotional speech 
expression, suggesting that gender information will bring certain advantages in 
SER. Ways of incorporating gender information into SER can be summarized 
into two methods. The first is to create a separate emotion model for each gender, 
© Springer Nature Switzerland AG 2018 
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which is referred to as Sep-System for the separate model [3]. The second is to 
take gender information as an augmented feature vector, which is referred to as 
Aug-System [4]. Both methods can utilize the gender information of speakers 
and thus improve the accuracy of SER. 

In Sep-System, gender information need not to be represented and the prob- 
lem can be decomposed into gender identification followed by emotion recogni- 
tion. However, utterances of the corresponding genders can be used to train male 
and female emotion classifiers individually. Moreover, treating gender separately 
increases the error of gender recognition and needs longer time in emotion recog- 
nition. Regarding Aug-System, all the utterances in a training set are used to 
train the emotion classifier, which makes it difficult to represent gender infor- 
mation. Conventional methods of encoding gender employ a unique integer or 
one-hot encoding. The gender ID is assigned by a nominal value, using fixed and 
simple numerical values to encode gender may not make use of gender informa- 
tion adequately. 

In this work, we present a novel approach to deal with these challenges. Aug- 
System is chosen to address the problem of insufficient training data. To rep- 
resent gender information more properly, the distributed-gender feature is pro- 
posed to describe the distribution of male and female speakers. The distributed- 
gender feature is a set of random values, where male speakers are distributed 
between 0 and 0.5 and female speakers are distributed between 0.5 and 1. The 
distributed-gender feature is different in each utterance and reflects individual 
differences. In order to utilize acoustic information and real individual differences 
of humans, we propose the gender-driven feature, a gender-conscious bottleneck 
feature that is extracted from acoustic features using DNN. The gender-driven 
feature not only distinguishes between men and women, but also retains discrim- 
inative acoustic information. Then, the two features are augmented into origi- 
nal spectrogram individually. Finally, the gender-aware CNN-BLSTM model is 
used to extract hierarchical feature and distinguish emotions. To the best of our 
knowledge, our work is the first to make use of gender information, acoustic 
information and original spectrographic features simultaneously for SER. 

The outline of this paper is as follows: related work is presented in Sect. 2. 
Our proposed gender-aware CNN-BLSTM is introduced in Sect. 3. Sections 4 and 
5 cover the experiments, conclusion and future work. 


2 Related Work 


Gender information has been widely used for SER task. In [5], they used gen- 
der information to improve the accuracy of SER. They revealed that the com- 
bined gender and emotion recognition system performed better than gender- 
independent emotion recognition system. 

In [6], additional speaker-related information such as speaker identity, gender 
and age were used on Sep-System and Aug-System. However, adding gender 
information resulted only in a slight improvement. 

Methods for SER have great achievements using acoustic features provided 
by INTERSPEECH 2009 Emotion Challenge [7]. The 384-dimensional acoustic 
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Table 1. Acoustic features of INTERSPEECH emotion challenge 2009 


LLDs (16 * 2) Statistical values (12) 

(A)ZCR (A)RMS energy | Mean; standard deviation; extremes: min/max value; range; 
(A)FO (A)HNR Relative min/max position; kurtosis; skewness; 

(A)MFCC (1-12) Linear regression: offset; slope; MSE 


features consist of 32-dimensional low-level descriptors (LLDs) and their statis- 
tical values, which are described in Table 1. 

In recent years, deep networks based on spectrogram [8,9] improved the accu- 
racy of speech recognition. In [10-12], they employed CNN-BLSTM to deal with 
spectrograms and showed significant enhancements on SER. In the present work, 
we follow the successful structure to perform emotion recognition. 


3 Gender-Aware CNN-BLSTM 


3.1 Distributed-Gender Feature 


Although basic emotions are shared between cultures and nationalities [13], dif- 
ferent speakers express their emotions in different ways. In order to reflect indi- 
vidual differences, random variables are added to a fixed male or female tem- 
plate. It means that even the same gender has a slight difference. Finally, the 
distributed-gender feature of males is set to change from 0 to 0.5, while that of 
females varies from 0.5 to 1. 


3.2 Gender-Driven Feature 


Acoustic information can be used to classify male and female speakers. However, 
acoustic features are interrelated having small inter-class distances [14]. In this 
study, a DNN is used to transfer high-dimensional acoustic features to gender- 
driven feature that is discriminative to represent gender information. 


Visualization of Gender-Driven Feature. In this section, the gender-driven 
feature is described graphically. Figures1 and 2 show the feature space of the 
acoustic features and gender-driven feature, respectively. We use the PCA to 
reduce the acoustic features and gender-driven feature to two dimensions indi- 
vidually. The abscissa and ordinate in these two figures represent the first and 
second components of PCA, respectively. 

From the right panel of Fig.2, the distribution of male and female data 
is clear using gender-driven feature. Moreover, the boundaries of different emo- 
tions in the gender-driven feature are sharper than those in the acoustic features. 
Compared with Figs. 1 and 2, the gender-driven feature not only reflects gender 
information, but also retains acoustic information that is useful for SER. There- 
fore, the gender-driven feature is conjectured more effective for SER, which will 
be supported by experiments described in Sect. 4. 
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Fig. 1. Feature space of the acoustic features. The left panel shows the distribution of 
seven emotions. The right panel shows the distribution of male and female speakers. 


40 


Fig. 2. Feature space of the gender-driven feature. The left panel shows the distribution 
of seven emotions. The right panel shows the distribution of male and female. 


Gender-Driven Feature for SER. In this study, there are two reasons to 
add the gender-driven feature into spectrographic data. The first is that adding 
gender information can help improve the accuracy for SER. The gender-driven 
feature encodes male and female data better with variable values. The second 
reason is that wide-band spectrogram emphasizes formants but not FO, whereas 
FO is the main vocal cue for emotion recognition [15]. Since the gender-driven 
feature is extracted from the acoustic features, it still retains some acoustic 
information (e.g. FO) that is complementary to spectrogram. 

Figure 3 depicts the structure of our proposed method. In feature preparation 
and fusion stages, the DNN extracts 32-dimentional gender-driven feature from 
384-dimentional acoustic features. Then, the spectrogram and gender-driven fea- 
ture are combined as compositional feature (F). The compositional feature vec- 
tors of the j-th segment in the i-th utterance can be formulated as: 

Fig = [Sij GDF;;], (1) 
where the Si; and GDF;¿ correspond to spectrogram vector and gender-driven 
feature vector of the j-th segment in the i-th utterance, respectively. 
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Fig. 3. The gender-aware CNN-BLSTM with the gender-driven feature 


4 Experiments 


4.1 Experimental Setup 


Speech signals are chosen from Berlin Emotion Speech Database [16]. The 
database has seven categorical emotion types including disgust, sadness, fear, 
happiness, neutral, boredom and anger, where the numbers of utterances in each 
category are 46, 62, 69, 71, 79, 81 and 127, respectively. The dataset consists of 
535 simulated emotional utterances in German. There are 233 male utterances 
and 302 female utterances. 

Procedures in this study are as follows. All the trials are based on a CNN- 
BLSTM model. CNN is chosen first to extract hierarchical feature from original 
spectrogram, because it models temporal and spectral local correlations [17]. 
Adding BLSTM layers is to recognize sequential dynamics in consecutive utter- 
ances [18]. There are two convolutional layers and two max-pooling layers of 
CNN. The first convolutional layer has 32 filters with 5 x 5 size, and the second 
convolutional layer has 64 filters with 5 x 5 size. The size of two pooling layers is 
2 x 2. After flatten layer, a fully connected layer is used with 1024 units. There 
are two hidden layers in BLSTM, each of which has 256 units. In our experiment, 
utterances are split into segments with a 265 ms window size and a 25 ms shift 
length. For the limited size of Berlin Emotion Database, 10-fold cross validations 
are used in following trials. 


e Spectrogram: This is the baseline model, where only one emotion recognizer 
is created for speakers. Short-time Fourier transform (STFT) are used to 
transform segmental signals into amplitude spectrogram. When doing STFT, 
the FFT points are 256. 

e Spectrogram (Sep-System): Compared with the above trial, we create male 
and female emotion classifiers separately. 
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e Spectrogram + one-hot gender feature: It is a straightforward way to add 
2-dimensional gender information into spectrogram. Male is represented as 
“0 1”, while female is “1 0”. 

e Spectrogram + fixed gender feature: The ground truth of gender information 
is encoded as fixed male template or female template. Then, spectrogram is 
augmented with fixed gender feature. 

e Spectrogram + distributed-gender feature (Proposed): 32- 
dimensional distributed-gender feature has been described in Sect. 3.1. 

e Spectrogram + LLDs: 32-dimentional LLDs described in Table 1 are added 
into spectrogram for SER. 

e Spectrogram + gender-driven feature (Proposed): The detail of this 
method is shown in Fig. 3. The structure of DNN contains three layers. There 
are 32 units in the bottleneck layer and 1024 units in other hidden layers. 
The input of DNN is acoustic features, and teacher signal is gender labels. 


4.2 Evaluation Results 


Table 2 shows results from the trials shown in the previous section. From Table 2, 
we conclude: (1) The spectrogram (Sep-System) perform worse than the base- 
line. It may be because there is less training data to train male and female 
emotion classifiers separately. (2) Because the size of one-hot gender feature is 
small, adding the one-hot gender feature into spectrogram shows slight improve- 
ments than baseline. (3) Using the distributed-gender feature and gender-driven 
feature in the gender-aware CNN-BLSTM outperforms the baseline by 14.04% 
and 45.74% relative error reduction in UA, respectively. In addition, the use 
of the gender-driven feature performs better than that of the distributed-gender 
feature. The reason is that the gender-driven feature represents gender character- 
istics, real individual differences and acoustic information, while the distributed- 
gender feature only reflects gender information. (4) Using the distributed-gender 
feature or gender-driven feature performs better than that of the fixed values. 
The reason is that the variable features can handle gender information in a 


Table 2. Weighted accuracy (WA) and unweighted accuracy (UA) of different features 
for SER. WA refers to the accuracy of all test utterances. UA is defined as average of 
per emotional category recall. F1 is the harmonic average of precision and recall. 


System | Features Size WA UA 

Aug Spectrogram (Baseline) 26 x 129 | 86.73% | 86.40% 
Sep Spectrogram 26 x 129) 86.17% | 85.46% 
Aug Spectrogram + one-hot gender feature 26 x 131) 86.92% | 86.24% 
Aug Spectrogram + fixed gender feature 26 x 161) 88.22% | 87.65% 
Aug Spectrogram + distributed-gender feature | 26 x 161| 88.97% | 88.31% 
Aug Spectrogram + LLDs 26 x 161) 91.21% | 90.76% 
Aug Spectrogram + gender-driven feature 26 x 161 | 92.71% | 92.62% 
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Fig. 4. Fl results of different features on different emotions 


more appropriate way. (5) Improvements are shown when LLDs are added into 
spectrogram. The result reveals that spectrogram and acoustic information are 
complementary. (6) The gender-driven feature performs better than LLDs. The 
result shows evidence that the gender-driven feature not only provides gender 
information but also retains discriminative acoustic information of emotions. 
Figure 4 shows the contribution of different features to identify different types 
of emotions. This figure reveals the following: (1) Although the training data in 
spectrogram (Sep-System) is less than the baseline, it still performs better on 
fear and neutral emotions. (2) Use of the distributed-gender feature to repre- 
sent gender performs better than that of the one-hot gender and fixed gender 
features on disgust, happiness, anger, boredom and neutral. (3) Adding gender- 
driven feature to spectrogram contributes to the best results on most emotions, 
except for anger. Conversely, adding the LLDs to spectrogram achieves the 
best performance on anger. The reason may be that after the DNN processing, 
the gender-driven feature is more effective to distinguish gender characteristics. 
Although the gender-driven feature still keeps acoustic information to classify 
anger emotion, its contribution is smaller than that of the LLDs. Overall, both 
the distributed-gender feature and gender-driven feature are effective for SER. 


5 Conclusions and Future Work 


In this paper, the gender-aware CNN-BLSTM was proposed for speech emotion 
recognition. We first proposed the distributed-gender feature and gender-driven 
feature. Then, the two novel features with gender information were individually 
augmented into spectrogram as additional variables. Finally, the CNN-BLSTM 
was used to conduct the final classification. The results of evaluations indicated 
that our proposed features can take advantage of gender information adequately 
and perform better on SER task. For future work, multi-modal features including 
textural and visual features will be considered for constructing a SER system. 
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Abstract. To recognize emotional traits on speech is a challenging task 
which became very popular in the past years, especially due to the recent 
advances in deep neural networks. Although very successful, these mod- 
els inherited a common problem from strongly supervised deep neural 
networks: a large number of strongly labeled samples demands neces- 
sary, so the model learns a general emotion representation. This paper 
proposes a solution for this problem with the development of a semi- 
supervised neural network which can learn speech representation from 
unlabeled samples and used them in different emotion recognition in 
speech scenarios. We provide experiments with different datasets, rep- 
resenting natural and controlled scenarios. Our results show that our 
model is competitive with state-of-the-art solutions in all these scenar- 
ios while sharing the same learned representations, which were learned 
without the necessity of strong labeled data. 


Keywords: Emotion recognition - Semi-supervised learning - GAN 
Speech representation - Deep learning 


1 Introduction 


Recent advances in deep learning provided an increase in popularity and robust- 
ness on emotion recognition in speech tasks [14, 20,23]. Such models usually make 
use of a large number of labeled samples to learn general representations for 
emotion recognition, providing state-of-the-art results in different speech related 
scenarios [2,8,21]. 

However, supervised deep learning needs a lot of labeled training data. 
Another problem with the current supervised deep learning models lies in the 
nature of emotion description itself. Different persons can express and perceive 
the same emotion in many ways, which causes a lack of agreement about how to 
© Springer Nature Switzerland AG 2018 
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annotate samples from different scenarios [7]. One solution for this is the use of 
an even larger number of labeled samples to represent different emotional states 
into a general emotion categorization. 

The use of unsupervised learning becomes useful to solve this problem since 
it does not require labeled data to learn general speech representation, which 
can be transferred to emotions. To work around this problem, recent works like 
[1,15,16,19] apply semi-supervised training on deep neural networks for image 
classification in domains where the labeled date is scarcity. 

If we train a deep neural model with a dataset from a given domain, the 
model will specialize in that scenario. But a model specialized on to generate 
a general representation of the data will be capable of representing the audio 
in every presented scenario. To be able to be general enough, deep learning 
models for speech emotion recognition usually rely on a large number of labeled 
samples. This issue happens because (1) deep neural models need a large number 
of samples to learn descriptors which are robust enough to generalize the domain 
where they are applied. (2) Strongly supervised training produces a fast and 
more focused change on the gradient directions, which usually leads to a better 
fine-tuning of the descriptors and separation boundaries for classification. 

We propose a hybrid neural network, composed of an adversarial autoen- 
coder to learn general speech representations and use it as input to a strongly 
supervised model to classify emotion expressions in the speech in different sce- 
narios. In the first step, the model learns how to represent the audio through 
an unsupervised training process. This representation will be the input for the 
second step where the model learns the separation boundaries and distribution 
between classes through a supervised learning process. In the unsupervised step, 
a Generative Adversarial Network (GAN) trains an autoencoder that will be 
responsible for learning how to represent speech present in the audio. Asa GAN 
has unsupervised training, the model can use unbranded and not emotional 
data what possibilities that the use of the trained model over different scenar- 
ios. After training, the encoder filters can extract prosodic characteristics of the 
input speech without the necessity of supervised labels. The encoder ends up 
learning representations based on the data distribution. The second module of 
the proposed model uses these prosodic characteristics learned by the encoder as 
low-level feature representations, and, now using a strongly supervised solution, 
is trained to classify emotion recognition in speech. A set of different filters also 
composes the classifier. These filters are fine-tuned and learn high-level abstrac- 
tions of the input signal, which are pertinent to that specific domain. 

We make used of an unconstrained and unlabeled corpus to learn general 
speech representations, which is shared among all our emotion recognition sce- 
nario. Our specific classifiers are fine-tuned to specific emotion recognition high- 
level characteristics. This reduces the training effort and applicability of the 
model to different emotion recognition scenarios. 

So, in emotion recognition task, the use of general speech representation, 
training in an unsupervised manner, improve the application performance and 
also build an adaptive model for others scenarios, once the speech represen- 
tation doesn’t be stuck in the scenario of the dataset evaluated. The main 


Semi-supervised Model for Emotion Recognition in Speech 793 


contribution of this proposition is the general speech recognition model. This 
model can fit in different emotional recognition scenarios and different datasets 
without retraining. In other emotional recognition works the audio represen- 
tation ends up stuck in the scene obtained from the training dataset. In our 
proposition the audio representation is more robust, being able to represent dif- 
ferent domains, situations, and languages. 

We evaluate the performance of our model in three different scenarios: 
indoor, outdoor and cross-language and compare it with state-of-the-art solu- 
tions. We prove that our model learned a general speech representation which 
is shared among all these scenarios, and the different specific filters learn high- 
level abstractions which are unique for each of these scenarios. For that, we use 
three different datasets: the Surrey Audio-Visual Expressed Emotion Dataset 
(SAVEE) [13] which represents a controlled environment, usually found in indoor 
scenarios or simple interactions, the OMG Emotion Dataset [3], which represents 
an in-the-wild, outdoor, unrestricted scenario and finally the Berlin Database of 
Emotional Speech (EmoDB) [6] which evaluates how well the learned represen- 
tations learned with speech signals in one language can be transferred to other 
for emotion recognition. This way, we can prove the universal aspect of emo- 
tion recognition, and that our fine-tuning step learns to correlate the emotional 
aspects of the general speech representation, ignoring the information which is 
not necessary for this task. 


2 Proposed Model 


In this work, we propose a semi-supervised model for emotion recognition. The 
model contains two modules: the first one is the general speech representa- 
tion and the second one is the classifier model. Figurel presents the model 
illustration. 

The training of the first module of our network happens in an unsupervised 
way. The first model is composed of an autoencoder trained by a GAN. We use 
the encoder present in the autoencoder model for learning the general speech 
representation. The GAN was chosen because have an autoencoder in its struc- 
ture and allows an adversary training with a large amount of unlabeled data. The 
speech representation generated by the model will be the input for the second 
part of the model. 

The second part is responsible for the distribution between the classes in an 
emotion classification or for prediction from the values in a dimensional model. 
The training of this module is in a supervised way. We adapt the output of the 
classifier accordingly to the task: or we use binary classification for categorical 
emotions (e.g., anger, fear, happiness, etc.) or we use a double-head one unit 
structure, for arousal/valence regression. 


2.1 Adversarial Autoencoder 


The Generative Adversarial Network (GAN) [12] has had a significant impact on 
data generation, mainly of images, but also in audio applications, for example 
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Fig. 1. Abstraction of the classifier and prediction models 


for melody generation [22] and noise cleaning [18]. The basic idea of a GAN is 
to conduct unsupervised adversarial training in two artificial neural networks, 
a discriminator model (D) and a generator model (G). The training process 
occurs similarly to a minimax two-player game, in which G captures the data 
distribution, and D estimates the likelihood of an example coming from G to 
be real. The G training procedure is to maximize the probability of D making a 
mistake. 

The Boundary Equilibrium Generative Adversarial Networks (BEGAN) [5] 
is a GAN variation, and have a differential are the use of an autoencoder as a 
discriminator. Others particularities of BEGAN is the loss derived from Wasser- 
stein’s distance; the addition of a y variable to balance GAN training; and the 
addition of a new metric called m global. 

The training of the basic GAN, proposed by Goodfellow [12], G e o D is 
trained in an The training of the basic GAN, proposed by Goodfellow [12], G 
and the D are trained in an adversarial way. Figure 2 presents the training repre- 
sentation of a GAN. In this figure, x represents the real samples, represents the 
generated samples by the generator and z is the generated noise that is the gen- 
erator inputs. The training of D has two different moments: first one where the 
inputs are real samples and the expected output is the real class that example 
belongs to (i.e., class 1). The second one where the inputs are samples generated 
by the generator from the noise and the expected output is a fake classification 
(i.e., class 0). For the generator, the flow is: the generator module receives as 
input a noisy, and a fake sample generated as an input of the discriminator. The 
objective is to make this sample be confused with a real sample, so the expected 
output is a real classification by the discriminator. In this step the discriminator 
training is frozen, and only training the generator. 

A characteristic BEGAN is the application of a balance paired with a loss 
derived from Wasserstein’s distance to the autoencoder training [5]. In the 
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Fig. 2. Training representation of discriminator (D) and generator (G) 


training step, the BEGAN has a balancing factor, defined by a variable y, with 
a range of O to 1. This variable penalizes D training, slowing it down. Since G 
training is more difficult and slower than D, this penalty balances the algorithms, 
thus increasing the performance of GAN [5]. 

GANs were used recently on semi-supervised learning for image classification 
tasks [1,15,16,19] and was shown to be more effective than strongly supervised 
classification. That happens because the use of unsupervised training makes 
possible to the model to learn general representations of the domain, while the 
supervised fine-tuning specializes in the model to solve the specific tasks. We 
choose to use a variation of an adversarial autoencoder, the BEGAN [5] because 
it presented better results than common GANs results on learning general rep- 
resentations. 


2.2 Supervised Classifiers 


The supervised module of the proposed model varies according to the emotion 
recognition scenario. But the basic structure is: it receives as input the speech 
representation obtained by the unsupervised module, then it applies convolu- 
tional layers and a softmax classifier which is adapted depending on the scenario. 

We optimize the hyper-parameters of the supervised module for each task. 
For that, we use the Hyperas [4] framework, where is specialized in optimizing 
search spaces with real, discrete and conditional dimensions. 


2.3 Semi-supervised Learning 


The adversarial autoencoder will be pre-trained with a database with a larger 
number of data. Once trained the autoencoder, don't need to retrain this model 
and this same autoencoder can be reuse in others applications, also without the 
need of retraining. 

The supervised model training happens during the semi-supervised model 
training process. In this process, we freeze the encoder layers trained previously 
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and train only the supervised module. The layers of the autoencoder are freeze 
because if it is trained too with the evaluated dataset, lose the nature of general 
speech representation, specializing in the dataset scenario. 


3 Experimental Methodology 


3.1 Datasets 


We use one dataset to train the unsupervised part of our model, and three to 
evaluate the whole model in different scenarios. The LibreSpeech [17] dataset is 
one of the largest audio datasets available, and we use it to train the unsuper- 
vised module. We use this dataset because its amount of data and variability 
of speakers and scenario is interesting to generate a general representation of 
speech. LibriSpeech is a dataset with approximately 1000h of English speech. 

We use three others datasets, and these datasets are emotional, multimodal 
and multispeaker. Each one has different characteristics and scenarios, which 
possibility different analysis. Therefore, we evaluated our model in an indoor, 
outdoor and cross-language scenarios, witch SAVEE [13], OMG Emotion [3] and 
EmoDB [6] datasets, respectively. 


SAVEE. We used the Surrey Audio-Visual Expressed Emotion Dataset 
(SAVEE) [13] in our experiments. SAVEE is an emotional audiovisual dataset, 
with consists of recordings of four male actors speaking phrases in 7 different 
emotion intonations based on the Universal Emotions [11] with the addition 
of the neutral emotion, where the speaker not present any of the six universal 
emotions. 

This work uses only the auditive module of the dataset, and has 480 state- 
ments in total. The SAVEE database is balanced, recorded in a controlled and 
noise-free environment and only has male voices. Therefore is considered a simple 
base and applied as the starting point of the experiments. 


OMG Emotion Dataset. The One-Minute Gradual-Emotional Behavior 
dataset (OMG-Emotion) [3] is the database from the One-Minute Gradual- 
Emotion Behavior Challenge, which takes place at IJCNN 2018. The dataset 
contains 567 unique videos totaling 7371 clips each clip consisting of a single 
utterance. Each video has a different utterance number with an average dura- 
tion of 8s by utterance and total average video duration next to 1 min. 

The dataset has dimensional and categorical labels, being seven different 
emotions, based on the Universal Emotions [11] with the addition of the neutral 
emotion. The dataset also has continuous dimensional label being arousal and 
valence with values in a range between —1 and 1. OMG emotion dataset is a 
complex, given its variability of speakers, scenarios, dialogs and videos duration. 
The dataset labels are either categorical and dimensional, what makes possible to 
verify the proposed model performance in different emotional recognition tasks. 
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EmoDB. The Berlin Database of Emotional Speech (EmoDB) [6] is an emo- 
tional speech database recorded in German. It contains about 500 utterances 
spoken by the actors in a happy, angry, anxious, fearful, bored and disgusted 
manner, as well as in a neutral version. It has statements from 10 different actors 
and ten different texts. We used the EmoDB dataset to verify it the proposed 
model can also generalize emotional characteristics from other languages. 


3.2 Preprocessing 


Our first preprocessing step was to change the audio frequency to 16kH. Then 
each audio track was decomposed into 1-s chunks without overlapping. After 
that, the raw audio was converted to a spectrogram via Short Time Fourier 
Transform, with an FFT of size 1024 and a length of 512. 


3.3 Experiments Setup 


To evaluate our model on the SAVEE dataset, we train the BEGAN with part 
of the LibreSpeech dataset but evaluating the emotion classification model with 
SAVEE dataset. To be possible to compare with another work, this experiment 
follows the same protocol of Ashwin work et al. [2], where perform the job 
of classifying emotions present in audio and video proposing a novel hybrid 
SVM-RBM classifier. We compare just with the audio module. Ashwin et al. 
perform the experiment called dependent speaker, which uses each speaker sets 
for training, and for each evaluated test of each speaker (DC, JE, JK, KL). The 
division of the base is approximately 60% for training and 40% for testing. 

Experiments were also carried out with the OMG Emotion dataset, with cate- 
gorical and dimensional labels, which allows the evaluation of two emotion recog- 
nition tasks: the classification of static emotion and prediction of dimensional 
values arousal and valence. For all experiments with OMG Emotion dataset, 
the training process of BEGAN uses part of the LibreSpeech dataset, and the 
division of the training and testing process follows the same distribution made 
available in the database itself. 

The experiments performed on EmoDB dataset follow the Leave One Speaker 
Out protocol (LOSO) to be possible perform the comparative with other works 
that follow the same protocol. In the experiment, we train the BEGAN with 
part of the LibreSpeech dataset recorded in English and the model evaluated on 
EmoDB dataset which is one German language recorded database. 

We train the algorithms in each experiment with 100 epochs, with a batch 
size of 16. The discriminator and the generator of the BEGAN used the Adam 
optimizer with a learning rate of 0.00005. The BEGAN also has a gamma value 
that balances the generator and the discriminator with a value of 0.7. 


4 Results and Discussion 


Table 1 shows the accuracy averages achieved with ten executions of the model 
and the best results obtained in Ashwin’s work [2]. The results obtained with 
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this proposal are bigger than the related work. The standard deviation is small, 
so this means that the model proves to be stable. 


Table 1. Comparison between the accuracy (%) averages 


DC JE ‘JK KL 
Ashwin et al. [2] | 79 78 76 80 
This work 80.69 (+2.96) | 80.96 (+3.41) 80.15 (£1.85) 82.46 (+2.70) 


Table2 presents the summary of the executions of the model when tested 
with the OMG Emotion Dataset, the baseline results [3], and the best result 
obtained in the challenge in audio modality!. The table has the F-score of the 
classifier and also has the CCC of the arousal and valence values predicted. The 
result F-score obtained with the classifier model was higher than the baseline, 
and has the advantage that a general speech representation was used and that 
it can be reused without the need of re-training in other datasets. The CCC 
obtained in our experiments is smaller than the result obtained in the challenge, 
but is better than baseline work. We obtained this result with the same model of 
the classification experiments, without specific treatment for this task and still 
the result is better than the baseline. 


Table 2. Results with the OMG emotion dataset 


F-score | Arousal CCC | Valence CCC 


Barros et al. [3] 0.39 0.07 0.04 
OMG emotion challenge | - 0.29 0.36 
This work [0.73 [0.17 0.16 


Table 3 presents the results from the executions with the EmoDB dataset and 
the comparison with other works that use the same experimentation protocol. 
As can be seen, our proposal is above of the related works. But considering that 
our model learns how to represent the emotional data in another language, the 
results can still be relevant for being next of the related works. 

The BEGAN trained with LibreSpeech database used in our experiments 
perform the training process only once. After saving the model, it can execute 
different experiments without the need for retraining. The no reed of retraining 
is one of the principal advantages of the proposed approach since once trained 
the model; we can use it for different databases and several tasks without the 
need for retraining. 


1 https: //www2.informatik.uni- hamburg.de/wtm/OMG-EmotionChallenge/. 
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Table 3. Results with EmoDB dataset 


Accuracy 
Deb and Dandapat [10] 83.80% 
Deb and Dandapat [9] 85.10% 
This work 72% 


5 Conclusion 


The work proposed is the development a new semi-supervised model for emotion 
recognition tasks. The use of this algorithm can help overcome one of the common 
challenges of emotion recognition field, which is the speech representation. 

We propose a general speech representation model, which is constructed with 
a GAN and trained in an unsupervised way and then incorporated into the 
models, thus building the semi-supervised model. From a set of experiments, 
with different datasets in the same algorithm, it was possible to verify that the 
use of GAN can help in the training of an emotion recognizer, that besides 
needing a smaller amount of training data in the supervised part, also achieves 
superior performance and provides a more stable algorithm. 

In this work, experiments were performed with the SAVEE dataset, which is 
a simple dataset, and also with the OMG Emotion Dataset, which is a complex 
database, given its speakers and scenarios variability, and has categorical and 
dimensional labels. In the experiments, it was possible to verify that the pro- 
posed model is superior to the baseline, and also the benefit of using a speech 
representation model that can be reused in other models and other databases. 

Experiment with a dataset of other language was performed. The speech 
representation module was trained with one dataset of the English language 
and was performed the emotion classification in a Germany dataset. The results 
were similar to related works used how baseline. This experiment proves that 
unsupervised model represents the speech emotional characteristics independent 
of the language. 

As a continuation of this work will be carried out sets of experiments where 
BEGAN will be trained with different datasets, and the semi-supervised learn- 
ing model will be evaluated with other datasets with different domains (e.g., a 
dataset with only children’s voices, a dataset in other languages, etc.). 
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Abstract. Convolutional Neural Networks (CNNs) have exhibited certain 
human-like performance on computer vision related tasks. Over the past few 
years since they have outperformed conventional algorithms in a range of image 
processing problems. However, to utilise a CNN model with millions of free 
parameters on a source limited embedded system is a challenging problem. The 
Intel Neural Compute Stick (NCS) provides a possible route for running large- 
scale neural networks on a low cost, low power, portable unit. In this paper, we 
propose a CNN based Raspberry Pi system that can run a pre-trained inference 
model in real time with an average power consumption of 6.2 W. The Intel 
Movidius NCS, which avoids requirements of expensive processing units e.g. 
GPU, FPGA. The system is demonstrated using a facial image-based emotion 
recogniser. A fine-tuned CNN model is designed and trained to perform infer- 
ence on each captured frame within the processing modules of NCS. 


Keywords: CNN - Embedded system - Low power system - SWaP profile 


1 Introduction 


Size, Weight and Power (SWaP) profile are important factors in many applications of 
real-time embedded systems. However, it is difficult to incorporate the benefits of deep 
learning (DL) in real-time embedded systems due to the limited computation capability 
and power. One solution is to use cloud computing [1]. This paper shows the first 
comparison between typical DL hardware and an edge device, which is applicable to 
any DL model that does not require any online learning. Hoping to show how the NCS 
can help bridge the gap between the two. The next phase of the Internet of Things 
(IoT) development will be adding intelligence to the devices. This will not only allow 
each device to share more in-depth information, but it will also require less information 
to be sent off the device which provides a greater level of security. These devices are 
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typically unable to run DL models due to the amount of processing required, which we 
show is no longer the case. 

The remainder of the paper is organised as follows. Section 2 introduces the 
background of DL on embedded systems. Section 3 describes the proposed Ras-Pi 
NCS system with details on system configuration working with a very simple self- 
designed CNN based on public emotion recognition dataset. The runtime speed and 
power consumption of system are presented in Sect. 4 and compared to several DL 
evaluation platforms. Section 5 concludes the paper. 


2 Background 


Advances in Smart devices and Internet of Things have led to a plethora of devices 
with the potential to have real-time embedded intelligence. As traditional DL research 
concentrated on GPUs, the focus was on computational runtime, accuracy and not 
power use [2]. However, low powered edge devices are an important area of research. 

Nvidia to date has arguably the most popular embedded DL devices with the Jetson 
range. With the NVidia Jetson range TK1, TX1 and TX2 have featured in up to 1,000 
research publications (Google Scholar Search) in a broad range of applications [3-5]. 
Benchmarks compared to PCs have shown there are good use cases for these devices, 
with power savings of 15-30 times, with a reduction in throughput by one-tenth, a net 
increase of 5-15 times [6]. 

Other chip manufacturers have introduced their own version of low powered 
embedded DL accelerators, such as Qualcomm with their Snapdragon Neural Pro- 
cessing Engine and Intel with both their Nervana Neural Network Processor and the 
device featured in this paper, the Movidius Myriad Visual Processing Unit (in the form 
ofthe Neural Compute Stick). Pena et al. [7] presented the first benchmarking results of 
the NCS, Raspberry Pi 3, and the Intel Jolue 750x development board. All tests are 
measuring the amount of time taking and power used to complete one pass of the 
network, with the NCSs results being averaged over both boards. It also showed how 
the different systems handled a variety of differing complexity networks. Figure 1 gives 
an illustration of the design system. The NCS, Webcam and Power Bank are connected 
via USB, while the touchscreen is connected via the GPIO pins to the Raspberry Pi. 
The actual system showing how all the components are connected and how the UI 
appears on the screen are shown on the right of Fig. 1. 
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Fig. 1. The Raspberry Pi — NCS system. 
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3 The System 


As is shown in Fig. 1 a Raspberry Pi 3 model B development board with 40 GPIO pins 
and four USB2.0 ports is used as the main processing unit in the proposed system. The 
power of board is supplied by a 5 V/2 A mobile DC power bank to make the whole 
system fully portable. The Raspberry Pi contains a 1.2 GHz 64-bit quad-core ARM 
Cortex-A53 CPU with 1 GB of RAM, running standard RASPBIAN Jessie desktop OS 
that supports required Python programming environment for the Intel NCS SDK to run 
in API mode. A Logitech C270 HD (720p) webcam is used via one USB2.0 port to 
provide desired video input. A 3.5” TFT touchscreen is set to satisfy general user 
interaction and visualising the online-processing results. The Intel NCS is used 
exclusively for the neural network model, with the information being transferred over 
the USB interface. 


3.1 The CNN for Emotes 


The architecture of the network had to be a reduced and cut down version of the state of 
the art architectures, as the inference models that are runnable on the NCS cannot 
resolve the unknown placeholders/variables. Very often these placeholders are 
employed for training specific parameters but are not necessary for NCS inferencing. 
Before trying to compile the NCS model, the Tensorflow model can be trained to 
generate three saved model files: index file as model indexing, data file as the network 
parameters and meta file which contains the network structure. These files will be 
further used in the shrunken version of the network to generate another set of inference- 
only network models. The original network is reduced with dropout layers and training 
specific code removed which usually contains: reading/importing data, loss function 
and accuracy computation definition, placeholders except for the input tensor of the 
network etc. The name for the input and output layer always requires to be set to make 
sure the compiler is easier to determine and recognise from the structure of the network. 
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Fig. 2. The designed CNN architecture for facial expression recognition, the stride of 
convolution kernel is set to be 1 and the stride of polling kernel is set to be 2. 


The facial based emotional recogniser network illustrated in Fig. 2 contains a total 
of 6 layers: 2 convolutional, 2 pooling, and 2 fully-connected. The Rectified linear unit 
(ReLU) activation function [8] is employed for each layer. The CNN inputs are 
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grayscale 48 x 48 images and followed by 2 convolutional layers with 32 and 64 
filters with the size of both being 3 x 3. The convolution operation in each layer is set 
to be 1-pixel strides with the same padding. The max-pooling layers with 2 x 2 kernel 
size is placed behind each convolutional layer to perform a subsampling for feature 
maps from the previous layer. The final stage is a fully connected dense layer with 
1024 neurons, the network output layer is comprised of 7 neurons performing the 
softmax [9] calculation which indicates the number of facial emotions. 


Training Phase. The training data utilises the FER2013 database [10]. According to 
the results from previous work, we consider that the order of the original dataset 
represents an unbalanced training set., which can lead to obvious overfitting and 
underfitting issues. We randomly initialised the order of the originally given dataset as 
well as correspond labels and split it into required data batches. The learning of the 
CNN employed the backpropagation incorporating cross-entropy as target loss function 
and the Adam stochastic optimiser [11]. In order to prevent the network suffering from 
overfitting, the Ridge Penalisation (L2 regularisation) is implemented among the cross- 
entropy function. The dropout technique [12] also well known as an effective regu- 
larisation to prevent network overfitting. In the training phase, the fully connected layer 
is set to randomly dropout with rate 0.5. 


Result. The designed convolutional network was validated on the self-defined (shuf- 
fled) testing set and validation set. The Extended Cohn-Kanade(CK+) [13] dataset was 
also used to evaluate the actual performance of the network with the same model which 
trained by the FER2013 dataset. After 100 epochs the model converged at 90.99% 
testing accuracy and 87.73% validation accuracy based on the shuffled FER2013 
dataset. The model showing a 70.51% test accuracy on the CK+ dataset with a very 
good performance in recognising happy with 100%, supervised with 92% and neutral 
with 84%. The confusion matrix on the FER2013 test set shows nearly perfect accuracy 
on each emotion. 


3.2 Embedded Device 


The Raspberry Pi 3 was the chosen flexible platform for this work. As a Linux 
microcomputer, it can run a multitude of programs similar to a Desktop PC. In our 
problem, it needed to be able to run the Tensorflow deep learning environment and the 
full Intel NCS SDK (although we are only going to utilise and use the API to save on 
space). This approach makes use of the Raspbian stretch desktop OS while utilising 
another 18 libraries. Along with API, the emotion recognition system makes significant 
use of two other important libraries, OpenCV [14], an open source computer vision 
library that is used for the Haar cascade face detection function and to display the live 
emotion recognition feed to the display. The decision to use the inbuilt Haar cascade 
was due to the speed at runtime compared to similar models, once optimised, for 
example, the dlib [15] libraries Histogram of Gradients (HoG) face detector. The Haar 
classifier was able to produce a robust detection of a face at 3-4x the speed of the HoG 
classifier. The Haar classifier is known to be the faster and less accurate classifier but 
proves to be robust enough as the face cropper within this system. The well-established 
Multi-Task Cascaded Convolutional Network (MTCNN) [16] deep learning approach 
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to face detection and alignment was also tested, as a recent implementation had 
appeared on the NCS GitHub page called the NCS App Zoo. This network requires the 
use of a further 2 Intel NCS units to run but would allow an end-to-end deep learning 
approach to the emotion recognition. However, due to the bounding box regression 
stages of the network, it proved to be slower with an average runtime similar to that of 
the dlib HoG classifier. 

The other important library to ensure real-time operation is imutils [17]. This 
convenience library helps with a significant speed up with one of the bottleneck areas 
of image acquisition. Compared to the OpenCV function, the imutils function utilises 
the multi-threading of the Raspberry Pi’s quad-core processor having a function able to 
collect the image from the webcam as soon as it is available and then store it to a queue 
of images. The result is the time taken to acquire the next image reduces from 10 s of 
milliseconds to 10 s of nanoseconds. 


3.3 Intel NCS 


DL is appearing in an increasing number of mobile devices without the necessity for 
cloud computing. The NCS used in this paper is from chipmaker Intel’s Movidius 
department, which incorporates one Myriad 2 machine vision processing unit into a 
small USB stick, Movidius announced that it delivers more than 100 gigaflops of 
performance. It can locally run neural networks inference model using Caffe and 
Tensorflow framework. A general development process of NCS based embedded 
system is illustrated in Fig. 3. The training process does not need to utilise the NCS 
stick or SDK but only standard DNN development on a desktop PC. Using the software 
SDK of the NCS, the user should subsequently perform training, profiling, tuning and 
compiling a DNN model on the NCS and a PC that runs x86 64bit ubuntu 16.04 OS. 
The provided SDK can check the validity of designed DNN and API for python and C 
languages. After that, any developer system (e.g. a raspberry pi) that runs a compatible 
OS with neural compute API can accelerate neural network inferences. 


Training DNN model Compiling Prototyping 
ne 1 ran co. [o.ocooo[ G] Pp 
! 1 1 1 | 
' i 1 1 | i 
1 1 1 I 
1 i >, | => , + 
1 i 1 1 1 
' 1 1 1 1 | 
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NAAA. A AA i ı_Sstem________ 


Fig. 3. Illustration of using the Intel NCS to develop for a DNN based embedded system 


4 Evaluation 


This section looks at the running of the emotion recognition program and delivers the 
results in terms of processing time. The Raspberry Pi 3 is compared against two other 
devices running the same application both running Ubuntu 16.04. An Alienware 15 
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Laptop with an Intel 17 6820HK Quad-core CPU 2.7 GHz with 16 GB of RAM and 
a GTX 980m GPU. The other machine was a Desktop PC with an AMD Ryzen 1700X 
Octa-core CPU @ 3.4GHZ with 32 GB of RAM and a GTX 1080Ti., with the added 
measure of how much power is being used to deliver the results. 


4.1 Real Time Running on Pi 


The actual system with all devices is illustrated in Fig. 1. Figure 4 shows runtimes, for 
10 runs of the code averaged, each running for 300 frames. The times for the non-deep 
learning parts are combined and averaged so as not to influence the overall results as 
only minor fluctuations appear between runs. Figure 4, breaks down the timing into 
sections: 


— Camera Read — the time taken for reading an image from the camera 

— Image Show — the time taken to display the image onto the touchscreen with 
emotion emoji, 

— Haar Face Detection — the time taken for the detector to crop the image around a 
found face 

— Inference Runtime — the time taken for the CNN model to run and 

— Loop Runtime — the total time taken to process one image of the video capture. 


The graph highlights which parts of the process are slow, especially on an 
embedded device, with the time axis given in a log format to allow for the differing 
magnitudes of time taken for different tasks to be represented equally. Only one section 
differs from the previously mentioned arrangement which is the R-Pi (opt) which is the 
optimised version to allow the system to run in true real-time (sub 33 ms), while still 
displaying the camera feed to the user. Modifications included running the Haar face 
detector only every 3” frame. Also removing the emoji image to the image displayed 
on the device and instead, printing the emotion to the terminal. This resulted in a saving 
of 20 and 12 ms respectively. Therefore, while the Pi with NCS can run at 14 fps 
(66 ms) the optimised code can run at 30 fps (33 ms), both of which can be classed as 
real-time, though the latter is obviously the preferred to perceive smooth motion on the 
video feed. Meanwhile, the Laptop and PC can output 142-167 fps (7 and 6 ms 
respectively), though to do so they consume a considerable amount more power which 
is typically an undesired trait for an embedded device. 


4.2 Benchmarking 


Figure 4, shows a significant speed up for the other processes in the application, with 
the GPUs managing to run the TensorFlow models with the fastest time as expected. 
A better comparison though is to see how the systems perform in terms of Inference per 
Second per Watt, which would be an important factor to consider with a minimal SWaP 
profile. Table 1 shows how the results for the 6 system types, with the new value given 
the term RP (Run-Power is the coefficient - inference/second/Watt). 

The results of the final experiment show that the Intel NCS given its low power 
usage of 1.2 W, rates highly when given these SWaP constraints. Which vary per 
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Fig. 4. Chart of results showing each device’s running time for the given section 


Table 1. Results of power related inferences. 


System Time (ms) | Power (watt) | RP (inf/s/W) 
R-Pi (+NCS) 45.26 (7.57) | 5 (6.2) 4.42 (21.31) 
Intel CPU (+GPU) 1.92 (1.04) | 45 (167) 11.57 (5.76) 
AMD CPU (+GPU) | 1.30 (0.73) | 95 (345) 8.10 (3.97) 


application in many of the DL use cases of Robotics, Human-Computer Interaction, 
Healthcare Application and several other Autonomous Systems. Given size and weight 
or even cost as extra parameters, the NCS would perform even higher than shown. 


5 Conclusion 


This paper presented a novel design concept, that shows how the Intel NCS device can 
help to bring state of the art DL to low powered edge devices. The combination of 
Raspberry Pi and NCS demonstrated the potential of these devices to help carry out 
complex image processing in real time similar to the Nvidia Jetson, the Intel NCS can 
be applied to almost any DL research area. This shows the ability of this low-cost 
inference model runner, to bridge the gap between current edge devices and desktop 
PCs for DL applications. With growing research into low powered embedded intelli- 
gence devices, this paper highlights the usefulness of this type of device. 
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Abstract. We present a method to replace the fully-connected layers of 
a Convolutional Neural Network (CNN) with a small set of rules, allowing 
for better interpretation of its decisions while preserving accuracy. 


Keywords: Convolutional neural network - Deep-learning 
Rule extraction - Random forests - Interpretability 


1 Introduction 


Convolutional neural networks (CNNs) perform extremely well in many visual 
classification and object detection tasks. However, interpreting neural networks 
is still a challenging task and many studies propose to visualize, analyze, or 
label the feature representations hidden in the internal layers. Such studies seek 
to obtain insights about the process that happens inside the neural networks 
when it classifies an image. Extracting simple IF-ELSE rules has been tackled 
in previous works, for instance by plugging deep Neural Decision Forests on a 
CNN [3] and more recently by interpreting CNNs with decision trees [7]. Herein 
we present a method to replace part of a neural network with an interpretable 
algorithm while preserving a similar level of accuracy. 


2 Methodology 


In our methodology, see Fig. 1, we: (1) use a trained CNN to extract features 
from a set of images, (2) train a Random Forest (RF) [1] to create a set of rules 
based on these features, and (3) rank the rules according to their utility, i.e., how 
much they contribute to the prediction, by applying a form of preference learning 
[5]. An analyst can then select the top-N rules allowing for an interpretation. 
First, we present a dataset of 30.000 imageNet [4] images covering 26 different 
classes to a pretrained VGG-16 CNN [6] and discard all the images classified 
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Fig. 1. Schematic summary of the proposed methodology 


wrongly. We then consider the average activations of the 512 features for all the 
images of a target class, and complete it with the same number of images from 
the other classes. This is the training set for a RF classifier that reproduces 
the behavior of the fully-connected layers of the CNN. Each root-to-leaf path in 
the trees of the forest corresponds to a rule. These rules are ranked, according 
to their predictive accuracy, by a simple perceptron. The perceptron is trained 
to classify the images by weighting the rule outputs, and minimize a small £ı 
penalty to mitigate rule correlations. The rules with the largest weights are kept 
as the top rules for the target class and may then be used as an approximation of 
the CNN for the class of interest, classifying any input image by majority vote. 


3 Results and Discussion 


Figure 2a shows that the accuracy of the selected rules quickly reaches an accept- 
able threshold and that only 3 rules are often enough for 85-95% accuracy. 
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Fig. 2. (a) Classification accuracy (on test set) per number of top rules selected. (b) 
Top rule for the ‘Great Grey Owl’ class with example images. Images are embedded 
according to their average filter activation. 
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The rule shown in Fig. 2b discriminates a class based on two relevant features. 
The most characteristic images of the class are grouped together as they have 
similar filter activations. We can also get the intuition for visual patterns that 
are present in one class but absent in all other classes. 

These results show that it is possible to reduce the complexity of the CNN 
(fully-connected layers) to a small set of relevant rules, without a great loss in 
accuracy. Furthermore, these rules can be interpreted by looking at how they 
split the input data. Such a combined analysis helps us to better understand the 
global behavior of the network. 

Further results can be found under our Github repository [2]. 
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Abstract. Multi-object tracking is a challenge in intelligent video analytics 
(IVA) due to possible crowd occlusions and truncations. Learning discriminant 
appearance features can alleviate these problems. An online multi-object 
tracking method with global-local appearance features is thus proposed in this 
paper. It consists of a pedestrian detection with pose estimation, a global-local 
convolutional neural network (GLCNN), and a spatio-temporal association 
model. The pedestrian detection with pose estimation explicitly leverages pose 
cues to reduce incorrect detections. GLCNN extracts discriminative appearance 
representations to identify the tracking objects, which implicitly alleviates the 
occlusions and truncations by integrating local appearance features. The spatio- 
temporal association model incorporates orientation, position, area, and 
appearance features of the detections to generate complete trajectories. Exten- 
sive experimental results demonstrate that our proposed method significantly 
outperforms many state-of-the-art online tacking approaches on popular MOT 
challenge benchmark. 


Keywords: Multi-object tracking - Pose estimation - Global-local features 
Spatial-temporal association 


1 Our Method 


Online multi-object tracking is a popular topic in computer vision [1-3], which con- 
centrates on identifying object identities at each incoming frame and achieving multiple 
complete trajectories in single camera. It recently attracts increasing attentions since the 
advance of detection based on deep learning [4, 5]. Many traditional methods [6, 7] 
have been revisited and achieved promising performance. Meanwhile, several methods 
[8-10] based on deep learning network have been proposed to improve multi-object 
tracking. However, the occlusions or truncations often result in incorrect detection and 
inconsistent appearance, which significantly decrease the performance of multi-object 
tracking algorithm. Targeting to solve these problems, an online multi-object tracking 
exploiting pose estimation and global-local appearance features thus is proposed in this 
paper, which consists of pedestrian detection with pose estimation, global-local 
appearance feature extraction, and spatio-temporal association model. 
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Pedestrian detection with pose estimation. The two-stage detection chooses 
improved Faster RCNN [11] is selected as basic framework, and incorporates pose 
estimation to reduce the lost objects and incorrect detections. In the first phase, we 
replace the VGG-16 with ResNet50 as the convolutional module and adopt five-scale 
anchors instead of feature pyramids. In the second phase, we change four fixed scales to 
adaptive size for pose estimation inputs. Global-local appearance feature extraction. 
A global-local convolutional neural network (GLCNN) is designed to extract dis- 
criminative appearance representations. It integrates two kinds of global features from 
unshared branches and three local features of different body parts. The first main branch 
only extracts global features, while the second one is responsible for extracting head, 
torso, legs, and another global features. The achieved global feature vectors will be 
merged as appearance representations by concatenation. Spatio-temporal association 
model. Common spatio-temporal features include the IOU between two bounding 
boxes, the position of person feet and so on. When facing with dense crowds or 
occlusions, such spatio-temporal features often increases the number of identity 
switches and fragmented trajectories. To avoid the problems, this paper chooses ori- 
entation, central position, and area of bounding boxes as the spatio-temporal features. 
After achieving the appearance features and spatio-temporal features, the multi-object 
trajectories are generated by measuring appearance feature similarity and spatio- 
temporal correlativity of pairwise detections. 

We show some qualitative results from static and dynamic cameras in Fig. 1, in 
which the different color bounding boxes with solid line indicate the tracking objects. 
Beside, we demonstrate the effectiveness of the proposed online multi-object tracking, 
referred as FMOT, on MOT benchmark [12]. The evaluation results illustrate that our 
approach is superior to many state-of-the-art methods. 


N 


Fig. 1. Visual Results on the MOT benchmark. (Best viewed in color) 
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